Source: https://simonwillison.net/2025/Dec/31/the-year-in-llms/
2025 in LLMs (Simon Willison) — clustered
1) Capability shift: “reasoning” becomes the new default
- RLVR/inference-scaling drives 2025 progress: not necessarily bigger base models, but longer RL runs and “reasoning modes/dials.”
- The practical unlock isn’t puzzles—it’s tool-using planning, especially for research/search and debugging gnarly code.
2) Agents become real (by a pragmatic definition)
- Agents “happen” once defined as: LLMs running tools in a loop to achieve goals.
- Two breakout agent use-cases: search/research and coding.
- “Deep Research” long-running reports peak early, then fade as faster reasoning/search products catch up.
3) Coding agents take over dev workflows (and go async)
- Claude Code is framed as the year’s most impactful release; then a wave of CLI/IDE agents follows (Codex CLI, Gemini CLI, Qwen Code, etc.).
- A key pattern: asynchronous coding agents (prompt → wait → PR appears), safer by default (sandboxed) and parallelizable—even from a phone.
4) The terminal becomes mainstream again (via LLMs)
- CLI becomes a first-class interface because agents can generate the scary commands (bash/sed/ffmpeg) and wire them into workflows.
- “LLMs on the command line” stops being niche once the harness + models are good.
5) Safety culture tension: YOLO + normalization of deviance
- “YOLO mode” (auto-approvals) feels dramatically better… which is exactly why it’s risky.
- Repeated “nothing bad happened” encourages unsafe defaults → “Normalization of Deviance” framing (Challenger analogy).
- Security concern crystallizes around prompt injection + agents + real access.
6) Security concepts sharpen (browser agents + “lethal trifecta”)
- AI-enabled browsers are powerful but scary because the browser holds your life.
- “Lethal trifecta” = private data access + external communication + untrusted content exposure (the dangerous subset of prompt injection).
7) Economics: premium subscriptions normalize
- A new price anchor emerges: $200/month tiers for power users running token-hungry agents.
- The logic: heavy agent usage can burn enough tokens that “all-you-can-eat” becomes attractive.
8) Open-weight geopolitics: China leads the charts
- 2025 is framed as the year Chinese open-weight models top rankings (multiple labs, strong releases, often permissive licenses).
- They’re not “just catching up”—they’re competitive enough to move markets and narratives.
9) Modality + consumer virality: prompt-driven image editing
- “Upload an image, edit it with prompts” becomes a massive mainstream moment.
- Google’s “Nano Banana” models stand out for instruction-following and text-heavy images; OpenAI ships iterative image model APIs too.
10) Benchmarks get real: gold medals & longer task horizons
- Models hit gold-level performance in elite academic competitions (math/programming) as a public capability signal.
- “Long tasks” become a headline: time-horizon for autonomous SWE tasks stretches from ~minutes to multiple hours.
11) Standards & scaffolding: MCP vs Skills, plus conformance suites
- MCP explodes, then feels less central as “Bash is the universal tool” for coding agents.
- “Skills” (simple files + scripts) feel like a more ergonomic, lower-overhead primitive.
- Biggest practical unlock for agents: conformance suites / good test harnesses—give agents executable truth and they get much more reliable.
12) Culture & craft: vibe coding, tool-making, phone programming
- “Vibe coding” names a real behavior: fast, prompt-led prototyping where you don’t read diffs and ship “mostly works.”
- Personal productivity angle: build lots of small tools; do meaningful chunks of dev from a phone via async agents.
13) Quality + infrastructure backlash: slop + datacenters
- “Slop” becomes a mainstream label for low-quality AI output; emphasis shifts to curation and signal.
- Public sentiment turns against data centers: energy/noise/carbon politics rise, with Jevons-paradox vibes (efficiency → more use).