2024–2026: The Frontier
Coding agents, agentic products, new protocols, and where the field is now
Overview
2024–2026 witnessed a fundamental shift: LLM agents moved from research prototypes to mainstream products. The year 2025 will be remembered as when reasoning models became agents, when browser automation went viral, and when the infrastructure for multi-agent systems matured into production-ready frameworks.
Key storylines:
- Coding agents became the killer app — SWE-bench progress from ~12% (2024) to 77%+ (2025)
- Computer use went mass market — Anthropic Computer Use, OpenAI Operator, dozens of open-source tools
- Protocol wars — MCP (Anthropic, Nov 2024) and A2A (Google, Apr 2025) competed to become the standard
- Framework consolidation — Microsoft merged AutoGen + Semantic Kernel; LangGraph hit v1.0
- “2025 is the year of the agent” — became the defining claim of every AI company’s roadmap
Coding & Software Engineering Agents
This became the most competitive sub-field of agent research, driven by clear benchmarks and commercial demand.
SWE-bench: The Scoreboard
SWE-bench (Jimenez et al., 2023) measures the ability of agents to resolve real GitHub issues. Progress has been stunning:
Note: SWE-bench Verified (500 tasks, human-validated) became the standard leaderboard from Aug 2024. Earlier entries used SWE-bench Full (2,294 tasks) or Lite (570 tasks).
| Date | System | Benchmark | Score |
|---|---|---|---|
| May 2024 | SWE-agent + GPT-4 Turbo (Princeton) | Full | 12.47% |
| Mar 2024 | Devin (Cognition AI) | Lite | 13.86% |
| Late 2024 | Claude 3.5 Sonnet + scaffolding | Verified | ~49% |
| Oct 2025 | Claude Sonnet 4.5 | Verified | 77.2% (82.0% with parallel compute) |
| Nov 2025 | Gemini 3 Pro + Live-SWE-agent | Verified | 77.4% |
A separate harder benchmark, SWE-bench Pro (Scale Labs), tests generalization on never-before-seen private codebases. GPT-5 scores 23.3% and Claude Opus 4.1 scores 23.1% there (Scale Labs leaderboard) vs. 70%+ on Verified, revealing substantial overfitting to public repos.
Key takeaway: The progress is real but benchmarks are being saturated. Evaluation is moving to harder, less contaminated tasks.
SWE-agent (2024)
Yang et al. (Princeton NLP) · arXiv:2405.15793
The benchmark-setting open-source coding agent. Uses Agent-Computer Interface (ACI) — a carefully designed shell environment for agents, with tools optimized for code understanding and editing. Achieved 12.5% on SWE-bench full at release (12.47% per the official leaderboard); seminal for the field.
- GitHub: princeton-nlp/SWE-agent — 15k+ stars
OpenHands (formerly OpenDevin)
All-Hands AI · github.com/All-Hands-AI/OpenHands — 45k+ stars
The leading open-source autonomous software agent. Supports web browsing, file editing, shell execution, and code generation. Pluggable LLM backends (Claude, GPT-4, open models). Became the go-to open alternative to proprietary coding agents through 2025.
Devin (Cognition AI, 2024)
Cognition AI · cognition.ai
Launched March 2024 as “the first AI software engineer.” Raised significant venture funding and generated enormous press coverage. Uses a persistent developer environment (browser, terminal, editor). While its initial SWE-bench numbers were later revised, it validated commercial demand for autonomous coding.
Claude Code → Claude Agent SDK (2025)
Anthropic · anthropic.com/engineering/building-agents-with-the-claude-agent-sdk · Sep 2025
Started as Claude Code — an agentic coding tool for the terminal. By September 2025, Anthropic renamed the underlying SDK to the Claude Agent SDK, recognizing it had become a general-purpose agent harness powering deep research, video creation, note-taking, and “almost all of our major agent loops.”
- In March 2026, Anthropic launched Code Review for Claude Code — parallel multi-agent code review dispatching specialized reviewer agents
GitHub Copilot Coding Agent (May 2025)
GitHub · github.blog
GitHub Copilot gained autonomous agent mode, integrating with IDEs (VS Code, Xcode, Eclipse, JetBrains, Visual Studio). Agent mode allows multi-step autonomous task execution within the GitHub ecosystem. Introduced GitHub Agent HQ at Universe 2025 (Oct 2025) — a unified workflow for orchestrating any agent.
Agents 101: How to Work with Coding Agents (Devin / Cognition AI)
Cognition AI · devin.ai/agents101
Cognition’s practical guide for engineers integrating autonomous coding agents into their workflows — distilled from building Devin and observing thousands of users. The framing: “A human paired with an AI assistant can achieve more than any AI alone… turning every engineer into an engineering manager.”
Six key principles:
- Say how you want things done, not just what — specify the approach and architecture upfront; don’t just state the goal
- Tell the agent where to start — point it to the right files, repos, and docs; minimize wasted exploration
- Practice defensive prompting — anticipate where a “junior intern” would get confused; preemptively clarify
- Give access to CI, tests, types, and linters — feedback loops via tools are “the magic” of agents; typed Python > untyped; TypeScript > JavaScript
- Leverage your expertise — human oversight of correctness remains non-negotiable; you own the code
- Delegate immediately — when a side task comes up, delegate to the agent and refocus; agents enable async multi-tasking
Note: the guide references a growing ecosystem of similar agents (OpenAI Codex, Google Jules, Cursor, Claude Code) and integration patterns (Slack @agent, GitHub, Linear/Jira).
Aider
Paul Gauthier · aider.chat · GitHub
The practical human-in-the-loop coding assistant. Works in the terminal, handles git diffs, supports Claude, GPT-4, and local models. Emphasizes transparency and control over full autonomy. Widely used and respected for its pragmatic design. Maintains a LLM coding leaderboard ranking models by real-world coding ability.
Major Agentic Products (2025)
OpenAI Operator + Deep Research + ChatGPT Agent
Operator launched January 2025 — an autonomous browser-use agent powered by the Computer-Using Agent (CUA) model combining GPT-4o vision with RL-trained reasoning. Could book reservations, fill forms, shop online.
Deep Research launched February 2025 — powered by o3, conducts multi-step asynchronous research across the web, synthesizing findings into comprehensive reports. Hours of research in minutes.
In July 2025, these were merged into the ChatGPT Agent — a unified system combining Operator’s browser control, Deep Research’s information synthesis, and ChatGPT’s conversational ability. The first mass-market general-purpose agent.
Anthropic Computer Use (Oct 2024)
Anthropic
Launched October 2024, enabling Claude to directly control a computer: move the mouse, type, take screenshots, interpret the screen. First major LLM provider to offer this natively. Spawned a wave of open-source computer-use implementations.
Manus AI (Mar 2025 → Acquired by Meta, Dec 2025)
Butterfly Effect / Monica.im (Singapore, Chinese-founded) · manus.im · arXiv:2505.02024
Launched March 2025 as a general-purpose autonomous agent capable of completing complex real-world tasks with minimal human intervention — writing and deploying code, conducting research, managing files. Described as a “turning point” in AI development for its fully autonomous operation. Acquired by Meta in December 2025.
Google: ADK + Gemini Agents
Google · google.github.io/adk-docs · github.com/google/adk-python
Google’s Agent Development Kit (ADK) — an open-source Python framework for building, evaluating, and deploying AI agents. Model-agnostic but optimized for Gemini. Supports multi-agent workflows, tool integration, and deployment to Vertex AI. Introduced alongside Gemini 2.0’s native multimodal agent capabilities and Project Astra (real-time multimodal assistant).
New Protocols: The Infrastructure Layer
2025 saw the emergence of open protocols for agent interoperability — a sign that the field was maturing beyond one-off integrations.
Model Context Protocol (MCP) — Anthropic, Nov 2024
Anthropic · modelcontextprotocol.io
An open standard for connecting AI agents to external tools and data sources. Defines a client-server architecture: MCP servers expose tools/resources; agents connect as MCP clients. Quickly adopted across the industry — by early 2025, hundreds of MCP servers existed (GitHub, Slack, databases, filesystems, browsers).
The analogy: MCP is to AI agents what HTTP is to web browsers. OpenAI, Google, and all major framework builders endorsed it.
- Microsoft launched Playwright MCP (March 2025) — browser automation via MCP
- Key blog: Simon Willison on MCP
Agent2Agent Protocol (A2A) — Google, Apr 2025
Google · developers.googleblog.com · Donated to Linux Foundation Jun 2025
Announced April 2025 as a complement to MCP. While MCP handles agent-to-tool connections, A2A handles agent-to-agent communication across different vendors and frameworks. Donated to the Linux Foundation in June 2025 for neutral governance.
- Enables: cross-vendor agent delegation, trusted multi-agent pipelines, interoperability at scale
Framework Consolidation
Microsoft Agent Framework (Oct 2025)
Microsoft · azure.microsoft.com/blog/introducing-microsoft-agent-framework
Microsoft merged AutoGen (research multi-agent framework) and Semantic Kernel (enterprise SDK) into a single Microsoft Agent Framework — public preview October 2025, broader announcement December 2025, GA Q1 2026. Supports Python and .NET. Represents Microsoft’s bet on production-ready agentic AI for enterprise.
LangGraph v1.0 (Nov 2025)
LangChain · github.com/langchain-ai/langgraph
LangGraph hit production-ready v1.0 in November 2025, becoming the dominant framework for complex stateful agent workflows in production. Used by Klarna, Replit, Elastic, and many others. Positioned as the “right tool” for agents needing fine-grained control over state and flow.
Goose (Jan 2025)
Block (Jack Dorsey’s company) · github.com/block/goose
Open-source, extensible AI agent that goes beyond code suggestions — installs, executes, edits, and tests with any LLM. Launched January 28, 2025. Popular in open-source circles for its extensibility and independence from proprietary ecosystems.
Browser & Computer Use Ecosystem
browser-use
github.com/browser-use/browser-use — 78,000+ stars
The #1 open-source browser automation platform. Makes websites accessible for AI agents, combining LLMs with visual recognition for real-time browser control. Exploded in 2025 following Anthropic Computer Use’s release, providing an open-source implementation for any model.
Skyvern
AI agent for browser workflows using computer vision and LLMs — no CSS selectors or DOM knowledge needed. Visual workflow builder. Inspired by BabyAGI/AutoGPT but grounded in visual understanding.
Stagehand (Browserbase)
TypeScript-first browser automation framework for AI agents. Pairs well with browser-use (Python). Strong developer adoption in 2025.
Key Blog Posts & Practitioner Resources (2024-2026)
Anthropic: Building Effective Agents (Dec 2024)
Erik Schluntz & Barry Zhang (Anthropic) · anthropic.com/research/building-effective-agents
The most practical and widely-cited 2024 post on agent design. Defines the crucial distinction:
Workflows = multiple LLMs orchestrated with pre-defined patterns
Agents = LLMs that dynamically direct their own processes and tool usage
Describes 5 core workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer) and when to build truly autonomous agents vs. structured workflows. Essential reading.
Google Cloud: Lessons from 2025 on Agents and Trust (Dec 2025)
Google Cloud CTO Office · cloud.google.com
Retrospective from Google’s enterprise deployments. Key insight: agents need “agent undo stacks” — idempotent tools and checkpointing that trigger safe rollbacks on failure. “AI grew up and got a job in 2025.”
Agentic AI, MCP, and Spec-Driven Development (Jan 2026)
GitHub Blog · github.blog
GitHub’s summary of the most-read technical content of 2025. The top themes: agentic coding, MCP adoption, and spec-driven development (writing AGENTS.md files to guide coding agents).
What We Learned from a Year of Building with LLMs (2024)
Eugene Yan, Bryan Bischof, Charles Frye, Hamel Husain et al. · applied-llms.org
Practitioner wisdom from engineers who shipped production LLM systems. Heavy on agent reliability, evals, and avoiding common failure modes. Widely cited as the most actionable guide of 2024.
AI in Production: Frameworks, Protocols, and What Actually Works in 2026
47 Billion · 47billion.com
Four months of hands-on production building. Verdict: “the agent landscape in 2025 is simultaneously more capable and more fragile than the marketing suggests.”
New Papers: 2025 Surveys & Architecture Research
Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions (Oct 2025)
Abou Ali & Dornaika · arXiv:2510.25445
End-to-end survey covering January 2018–March 2025. Covers classical symbolic and modern LLM-orchestrated frameworks.
Agentic AI Frameworks: Architectures, Protocols, and Design Challenges (Aug 2025)
Derouiche et al. · arXiv:2508.10146
Focused survey on the architectural and protocol layer — MCP, A2A, orchestration patterns.
Architectures for Building Agentic AI (Dec 2025)
Proposes a practical taxonomy: tool-using agents, memory-augmented agents, planning/self-improvement agents, multi-agent systems, and embodied/web agents. Analyzes how each reshapes system design.
AI Agents vs. Agentic AI: A Conceptual Taxonomy (May 2025)
Distinguishes AI agents (single-purpose, reactive) from Agentic AI (multi-agent collaboration, dynamic decomposition, persistent memory, coordinated autonomy). Useful conceptual clarification.
Towards a Science of Scaling Agent Systems (Dec 2025)
Kim et al. · arXiv:2512.08296
Examines how agent system performance scales with the number of agents, compute, and interaction patterns.
The 2025 AI Agent Index (Feb 2026)
Documents technical and safety features of deployed agentic AI systems across product overview, company accountability, technical capabilities, autonomy & control, ecosystem interaction, and safety evaluation.
AI Agent Systems: Architectures, Applications, and Evaluation (Jan 2026)
Synthesizes emerging agent architectures for reasoning, planning, tool use, and deployment. Analyzes system-level trade-offs: autonomy vs. controllability, latency vs. reliability, capability vs. safety.
MIT AI Agent Index (2025)
MIT · aiagentindex.mit.edu
Systematic evaluation of 30 prominent AI agents based on publicly available information. Documents origins, design, capabilities, ecosystem, and safety features. Key findings:
- 24/30 agents were released or received major agentic updates in 2024-2025
- Autonomy split: Chat agents at Level 1-3; browser agents at Level 4-5; enterprise agents from Level 1-2 (design) to Level 3-5 (deployment)
- Transparency gap: Of 13 frontier-autonomy agents, only 4 disclose any agentic safety evaluations. Developers share far more about capabilities than safety.
- Foundation model concentration: Nearly all depend on GPT, Claude, or Gemini — structural ecosystem risk
- No web conduct standards: Some agents explicitly designed to bypass anti-bot protections
- Geographic divergence: 21/30 US-based, 5/30 Chinese, markedly different safety frameworks
Companion paper: arXiv:2602.17753
Safety — prompt injection, trust hierarchies, reversibility, the transparency gap — became a first-class concern as agents took real-world actions. Full coverage in Safety & Alignment →
METR: Task Time Horizons as the New Benchmark
METR (Model Evaluation & Threat Research) · metr.org/time-horizons
METR proposed measuring agents not by task accuracy but by task time horizon — the longest task an agent can complete with 50% reliability. This reframing captures something accuracy-based benchmarks miss: can agents sustain coherent action over minutes, hours, or days?
Key findings (March 2025): - Claude 3.7 Sonnet: ~50 minutes time horizon (50%-task-completion time horizon, per arXiv:2503.14499) - GPT-5 (later 2025): ~2 hours 17 minutes - The time horizon has been doubling approximately every 7 months since 2019 - Extrapolation: agents capable of multi-day tasks within a few years
This metric directly maps to real-world utility. A 1-hour agent can do a coding task or research summary. A 24-hour agent could run a full experiment pipeline or manage a complex project.
See Science & Research Agents → for more.
OpenAI Harness Engineering & Long-Horizon Codex (2025)
OpenAI Harness Engineering
OpenAI · openai.com/index/harness-engineering
OpenAI’s own internal experience with agentic software development. A team of 3 (growing to 7) engineers built a product with zero manually-written lines of code, over 1M+ lines of generated code, in ~5 months — with agents handling architecture, implementation, testing, and PR review.
Documents the workflow, failure modes, and lessons learned from living at the frontier of AI-assisted development. Key insight: at this level, the human’s job shifts from writing code to writing specs and reviewing agent outputs.
Running Long-Horizon Tasks with Codex
OpenAI · developers.openai.com/blog/run-long-horizon-tasks-with-codex
Guidance on using OpenAI Codex for extended, multi-step software tasks. Covers context management, checkpointing, failure recovery, and task decomposition for tasks that span hours rather than minutes.
From Vibe Coding to Agentic Engineering
GLM-5: From Vibe Coding to Agentic Engineering (2026)
GLM-5-Team (Zhipu AI & Tsinghua University) · arXiv:2602.15763
The GLM team (China’s Zhipu AI) traces the transition from informal prompt-based “vibe coding” to systematic agentic software engineering — with structured verification loops, multi-agent coordination, and principled task decomposition. Their GLM series is competitive with international frontier models on coding benchmarks, making this a significant non-Western perspective on the agentic coding frontier.
What’s Next
The open questions for 2026 and beyond:
- Long-horizon reliability — agents still fail on tasks requiring 100+ sequential steps
- Trust & verification — how do you audit what an autonomous agent did?
- Cost — complex agent workflows burn through API credits fast
- Benchmark saturation — as SWE-bench fills up, what’s the next hard benchmark?
- Agent interoperability at scale — MCP + A2A adoption race
- Safety for real-world deployment — reversibility, sandboxing, human oversight patterns
References
Papers & Benchmarks
- SWE-bench: Resolving Real-World GitHub Issues (Jimenez et al., 2023) — arXiv:2310.06770
- SWE-agent: Agent-Computer Interface Enables Automated Software Engineering (Yang et al., 2024, Princeton NLP) — arXiv:2405.15793
- Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions (Abou Ali & Dornaika, 2025) — arXiv:2510.25445
- Agentic AI Frameworks: Architectures, Protocols, and Design Challenges (Derouiche et al., 2025) — arXiv:2508.10146
- Architectures for Building Agentic AI (2025) — arXiv:2512.09458
- AI Agents vs. Agentic AI: A Conceptual Taxonomy (2025) — arXiv:2505.10468
- Towards a Science of Scaling Agent Systems (Kim et al., 2025) — arXiv:2512.08296
- The 2025 AI Agent Index (2026) — arXiv:2602.17753
- AI Agent Systems: Architectures, Applications, and Evaluation (2026) — arXiv:2601.01743
- GLM-5: From Vibe Coding to Agentic Engineering (GLM-5-Team, Zhipu AI & Tsinghua University, 2026) — arXiv:2602.15763
- From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent (Yang et al., 2025; independent academic overview) — arXiv:2505.02024
Open-Source Projects & Frameworks
- SWE-agent — GitHub: princeton-nlp/SWE-agent
- OpenHands (formerly OpenDevin) — GitHub: All-Hands-AI/OpenHands
- browser-use — GitHub: browser-use/browser-use
- Skyvern — GitHub: Skyvern-AI/skyvern
- Aider — aider.chat · GitHub: paul-gauthier/aider
- Goose — GitHub: block/goose
- LangGraph — GitHub: langchain-ai/langgraph
- Google ADK (Agent Development Kit) — google.github.io/adk-docs · GitHub: google/adk-python
Official Framework & Protocol Documentation
- Model Context Protocol (MCP) — modelcontextprotocol.io
- Agent2Agent Protocol (A2A) — developers.googleblog.com · Donated to Linux Foundation Jun 2025
- Microsoft Agent Framework — azure.microsoft.com/blog/introducing-microsoft-agent-framework
- Claude Agent SDK — anthropic.com/engineering/building-agents-with-the-claude-agent-sdk
- OpenAI Operator & Deep Research — Introducing Operator · Introducing Deep Research · Introducing ChatGPT Agent
- Anthropic Computer Use — anthropic.com
- GitHub Copilot Coding Agent — github.blog
Key Blog Posts & Practitioner Resources
- Building Effective Agents (Erik Schluntz & Barry Zhang, Anthropic, Dec 2024) — anthropic.com/research/building-effective-agents
- Simon Willison’s summary: simonwillison.net/2024/Dec/20/building-effective-agents/
- Reference Implementations for Agent Patterns (Anthropic Cookbook) — github.com/anthropics/anthropic-cookbook
- Lessons from 2025 on Agents and Trust (Google Cloud CTO Office, Dec 2025) — cloud.google.com
- Agentic AI, MCP, and Spec-Driven Development (GitHub Blog, Jan 2026) — github.blog
- What We Learned from a Year of Building with LLMs (Eugene Yan et al., 2024) — applied-llms.org
- Model Context Protocol Overview (Simon Willison, Nov 2024) — simonwillison.net/2024/Nov/25/model-context-protocol/
- AI in Production: Frameworks, Protocols, and What Actually Works in 2026 (47 Billion) — 47billion.com/blog/ai-agents-in-production-frameworks-protocols-and-what-actually-works-in-2026/
- Agents 101: How to Work with Coding Agents (Cognition AI) — devin.ai/agents101
Enterprise Products & Evaluations
- MIT AI Agent Index (2025) — aiagentindex.mit.edu
- Companion paper: The 2025 AI Agent Index (Staufer et al., 2026) — arXiv:2602.17753
- Devin — cognition.ai
- Manus AI — manus.im
- Aider Coding Leaderboard — aider.chat/docs/leaderboards
Measurement & Research
- METR: Task Time Horizons as a Measure of Agent Capability (METR, 2025) — metr.org/time-horizons
- OpenAI Harness Engineering — openai.com/index/harness-engineering
- Running Long-Horizon Tasks with Codex (OpenAI) — developers.openai.com/blog/run-long-horizon-tasks-with-codex
Back to Overview → · See Resources → for a curated reading list