2025 / SimonW Summary

Source: https://simonwillison.net/2025/Dec/31/the-year-in-llms/

2025 in LLMs (Simon Willison) — clustered

1) Capability shift: “reasoning” becomes the new default

  • RLVR/inference-scaling drives 2025 progress: not necessarily bigger base models, but longer RL runs and “reasoning modes/dials.”
  • The practical unlock isn’t puzzles—it’s tool-using planning, especially for research/search and debugging gnarly code.

2) Agents become real (by a pragmatic definition)

  • Agents “happen” once defined as: LLMs running tools in a loop to achieve goals.
  • Two breakout agent use-cases: search/research and coding.
  • “Deep Research” long-running reports peak early, then fade as faster reasoning/search products catch up.

3) Coding agents take over dev workflows (and go async)

  • Claude Code is framed as the year’s most impactful release; then a wave of CLI/IDE agents follows (Codex CLI, Gemini CLI, Qwen Code, etc.).
  • A key pattern: asynchronous coding agents (prompt → wait → PR appears), safer by default (sandboxed) and parallelizable—even from a phone.

4) The terminal becomes mainstream again (via LLMs)

  • CLI becomes a first-class interface because agents can generate the scary commands (bash/sed/ffmpeg) and wire them into workflows.
  • “LLMs on the command line” stops being niche once the harness + models are good.

5) Safety culture tension: YOLO + normalization of deviance

  • “YOLO mode” (auto-approvals) feels dramatically better… which is exactly why it’s risky.
  • Repeated “nothing bad happened” encourages unsafe defaults → “Normalization of Deviance” framing (Challenger analogy).
  • Security concern crystallizes around prompt injection + agents + real access.

6) Security concepts sharpen (browser agents + “lethal trifecta”)

  • AI-enabled browsers are powerful but scary because the browser holds your life.
  • “Lethal trifecta” = private data access + external communication + untrusted content exposure (the dangerous subset of prompt injection).

7) Economics: premium subscriptions normalize

  • A new price anchor emerges: $200/month tiers for power users running token-hungry agents.
  • The logic: heavy agent usage can burn enough tokens that “all-you-can-eat” becomes attractive.

8) Open-weight geopolitics: China leads the charts

  • 2025 is framed as the year Chinese open-weight models top rankings (multiple labs, strong releases, often permissive licenses).
  • They’re not “just catching up”—they’re competitive enough to move markets and narratives.

9) Modality + consumer virality: prompt-driven image editing

  • “Upload an image, edit it with prompts” becomes a massive mainstream moment.
  • Google’s “Nano Banana” models stand out for instruction-following and text-heavy images; OpenAI ships iterative image model APIs too.

10) Benchmarks get real: gold medals & longer task horizons

  • Models hit gold-level performance in elite academic competitions (math/programming) as a public capability signal.
  • “Long tasks” become a headline: time-horizon for autonomous SWE tasks stretches from ~minutes to multiple hours.

11) Standards & scaffolding: MCP vs Skills, plus conformance suites

  • MCP explodes, then feels less central as “Bash is the universal tool” for coding agents.
  • “Skills” (simple files + scripts) feel like a more ergonomic, lower-overhead primitive.
  • Biggest practical unlock for agents: conformance suites / good test harnesses—give agents executable truth and they get much more reliable.

12) Culture & craft: vibe coding, tool-making, phone programming

  • “Vibe coding” names a real behavior: fast, prompt-led prototyping where you don’t read diffs and ship “mostly works.”
  • Personal productivity angle: build lots of small tools; do meaningful chunks of dev from a phone via async agents.

13) Quality + infrastructure backlash: slop + datacenters

  • “Slop” becomes a mainstream label for low-quality AI output; emphasis shifts to curation and signal.
  • Public sentiment turns against data centers: energy/noise/carbon politics rise, with Jevons-paradox vibes (efficiency → more use).