2024–2026: The Frontier

Coding agents, agentic products, new protocols, and where the field is now

Overview

2024–2026 witnessed a fundamental shift: LLM agents moved from research prototypes to mainstream products. The year 2025 will be remembered as when reasoning models became agents, when browser automation went viral, and when the infrastructure for multi-agent systems matured into production-ready frameworks.

Key storylines:

Coding agents became the killer app — SWE-bench progress from ~12% (2024) to 77%+ (2025)
Computer use went mass market — Anthropic Computer Use, OpenAI Operator, dozens of open-source tools
Protocol wars — MCP (Anthropic, Nov 2024) and A2A (Google, Apr 2025) competed to become the standard
Framework consolidation — Microsoft merged AutoGen + Semantic Kernel; LangGraph hit v1.0
“2025 is the year of the agent” — became the defining claim of every AI company’s roadmap

Coding & Software Engineering Agents

This became the most competitive sub-field of agent research, driven by clear benchmarks and commercial demand.

SWE-bench: The Scoreboard

SWE-bench (Jimenez et al., 2023) measures the ability of agents to resolve real GitHub issues. Progress has been stunning:

Note: SWE-bench Verified (500 tasks, human-validated) became the standard leaderboard from Aug 2024. Earlier entries used SWE-bench Full (2,294 tasks) or Lite (570 tasks).

Date	System	Benchmark	Score
May 2024	SWE-agent + GPT-4 Turbo (Princeton)	Full	12.47%
Mar 2024	Devin (Cognition AI)	Lite	13.86%
Late 2024	Claude 3.5 Sonnet + scaffolding	Verified	~49%
Oct 2025	Claude Sonnet 4.5	Verified	77.2% (82.0% with parallel compute)
Nov 2025	Gemini 3 Pro + Live-SWE-agent	Verified	77.4%

A separate harder benchmark, SWE-bench Pro (Scale Labs), tests generalization on never-before-seen private codebases. GPT-5 scores 23.3% and Claude Opus 4.1 scores 23.1% there (Scale Labs leaderboard) vs. 70%+ on Verified, revealing substantial overfitting to public repos.

Key takeaway: The progress is real but benchmarks are being saturated. Evaluation is moving to harder, less contaminated tasks.

SWE-agent (2024)

Yang et al. (Princeton NLP) · arXiv:2405.15793

The benchmark-setting open-source coding agent. Uses Agent-Computer Interface (ACI) — a carefully designed shell environment for agents, with tools optimized for code understanding and editing. Achieved 12.5% on SWE-bench full at release (12.47% per the official leaderboard); seminal for the field.

GitHub: princeton-nlp/SWE-agent — 15k+ stars

OpenHands (formerly OpenDevin)

All-Hands AI · github.com/All-Hands-AI/OpenHands — 45k+ stars

The leading open-source autonomous software agent. Supports web browsing, file editing, shell execution, and code generation. Pluggable LLM backends (Claude, GPT-4, open models). Became the go-to open alternative to proprietary coding agents through 2025.

Devin (Cognition AI, 2024)

Cognition AI · cognition.ai

Launched March 2024 as “the first AI software engineer.” Raised significant venture funding and generated enormous press coverage. Uses a persistent developer environment (browser, terminal, editor). While its initial SWE-bench numbers were later revised, it validated commercial demand for autonomous coding.

Claude Code → Claude Agent SDK (2025)

Anthropic · anthropic.com/engineering/building-agents-with-the-claude-agent-sdk · Sep 2025

Started as Claude Code — an agentic coding tool for the terminal. By September 2025, Anthropic renamed the underlying SDK to the Claude Agent SDK, recognizing it had become a general-purpose agent harness powering deep research, video creation, note-taking, and “almost all of our major agent loops.”

In March 2026, Anthropic launched Code Review for Claude Code — parallel multi-agent code review dispatching specialized reviewer agents

GitHub Copilot Coding Agent (May 2025)

GitHub · github.blog

GitHub Copilot gained autonomous agent mode, integrating with IDEs (VS Code, Xcode, Eclipse, JetBrains, Visual Studio). Agent mode allows multi-step autonomous task execution within the GitHub ecosystem. Introduced GitHub Agent HQ at Universe 2025 (Oct 2025) — a unified workflow for orchestrating any agent.

Agents 101: How to Work with Coding Agents (Devin / Cognition AI)

Cognition AI · devin.ai/agents101

Cognition’s practical guide for engineers integrating autonomous coding agents into their workflows — distilled from building Devin and observing thousands of users. The framing: “A human paired with an AI assistant can achieve more than any AI alone… turning every engineer into an engineering manager.”

Six key principles:

Say how you want things done, not just what — specify the approach and architecture upfront; don’t just state the goal
Tell the agent where to start — point it to the right files, repos, and docs; minimize wasted exploration
Practice defensive prompting — anticipate where a “junior intern” would get confused; preemptively clarify
Give access to CI, tests, types, and linters — feedback loops via tools are “the magic” of agents; typed Python > untyped; TypeScript > JavaScript
Leverage your expertise — human oversight of correctness remains non-negotiable; you own the code
Delegate immediately — when a side task comes up, delegate to the agent and refocus; agents enable async multi-tasking

Note: the guide references a growing ecosystem of similar agents (OpenAI Codex, Google Jules, Cursor, Claude Code) and integration patterns (Slack @agent, GitHub, Linear/Jira).

Aider

Paul Gauthier · aider.chat · GitHub

The practical human-in-the-loop coding assistant. Works in the terminal, handles git diffs, supports Claude, GPT-4, and local models. Emphasizes transparency and control over full autonomy. Widely used and respected for its pragmatic design. Maintains a LLM coding leaderboard ranking models by real-world coding ability.

Major Agentic Products (2025)

OpenAI Operator + Deep Research + ChatGPT Agent

Operator launched January 2025 — an autonomous browser-use agent powered by the Computer-Using Agent (CUA) model combining GPT-4o vision with RL-trained reasoning. Could book reservations, fill forms, shop online.

Deep Research launched February 2025 — powered by o3, conducts multi-step asynchronous research across the web, synthesizing findings into comprehensive reports. Hours of research in minutes.

In July 2025, these were merged into the ChatGPT Agent — a unified system combining Operator’s browser control, Deep Research’s information synthesis, and ChatGPT’s conversational ability. The first mass-market general-purpose agent.

Introducing Operator · Deep Research · ChatGPT Agent

Anthropic Computer Use (Oct 2024)

Anthropic

Launched October 2024, enabling Claude to directly control a computer: move the mouse, type, take screenshots, interpret the screen. First major LLM provider to offer this natively. Spawned a wave of open-source computer-use implementations.

Manus AI (Mar 2025 → Acquired by Meta, Dec 2025)

Butterfly Effect / Monica.im (Singapore, Chinese-founded) · manus.im · arXiv:2505.02024

Launched March 2025 as a general-purpose autonomous agent capable of completing complex real-world tasks with minimal human intervention — writing and deploying code, conducting research, managing files. Described as a “turning point” in AI development for its fully autonomous operation. Acquired by Meta in December 2025.

Google: ADK + Gemini Agents

Google · google.github.io/adk-docs · github.com/google/adk-python

Google’s Agent Development Kit (ADK) — an open-source Python framework for building, evaluating, and deploying AI agents. Model-agnostic but optimized for Gemini. Supports multi-agent workflows, tool integration, and deployment to Vertex AI. Introduced alongside Gemini 2.0’s native multimodal agent capabilities and Project Astra (real-time multimodal assistant).

New Protocols: The Infrastructure Layer

2025 saw the emergence of open protocols for agent interoperability — a sign that the field was maturing beyond one-off integrations.

Model Context Protocol (MCP) — Anthropic, Nov 2024

Anthropic · modelcontextprotocol.io

An open standard for connecting AI agents to external tools and data sources. Defines a client-server architecture: MCP servers expose tools/resources; agents connect as MCP clients. Quickly adopted across the industry — by early 2025, hundreds of MCP servers existed (GitHub, Slack, databases, filesystems, browsers).

The analogy: MCP is to AI agents what HTTP is to web browsers. OpenAI, Google, and all major framework builders endorsed it.

Microsoft launched Playwright MCP (March 2025) — browser automation via MCP
Key blog: Simon Willison on MCP

Agent2Agent Protocol (A2A) — Google, Apr 2025

Google · developers.googleblog.com · Donated to Linux Foundation Jun 2025

Announced April 2025 as a complement to MCP. While MCP handles agent-to-tool connections, A2A handles agent-to-agent communication across different vendors and frameworks. Donated to the Linux Foundation in June 2025 for neutral governance.

Enables: cross-vendor agent delegation, trusted multi-agent pipelines, interoperability at scale

Framework Consolidation

Microsoft Agent Framework (Oct 2025)

Microsoft · azure.microsoft.com/blog/introducing-microsoft-agent-framework

Microsoft merged AutoGen (research multi-agent framework) and Semantic Kernel (enterprise SDK) into a single Microsoft Agent Framework — public preview October 2025, broader announcement December 2025, GA Q1 2026. Supports Python and .NET. Represents Microsoft’s bet on production-ready agentic AI for enterprise.

LangGraph v1.0 (Nov 2025)

LangChain · github.com/langchain-ai/langgraph

LangGraph hit production-ready v1.0 in November 2025, becoming the dominant framework for complex stateful agent workflows in production. Used by Klarna, Replit, Elastic, and many others. Positioned as the “right tool” for agents needing fine-grained control over state and flow.

Goose (Jan 2025)

Block (Jack Dorsey’s company) · github.com/block/goose

Open-source, extensible AI agent that goes beyond code suggestions — installs, executes, edits, and tests with any LLM. Launched January 28, 2025. Popular in open-source circles for its extensibility and independence from proprietary ecosystems.

Browser & Computer Use Ecosystem

browser-use

github.com/browser-use/browser-use — 78,000+ stars

The #1 open-source browser automation platform. Makes websites accessible for AI agents, combining LLMs with visual recognition for real-time browser control. Exploded in 2025 following Anthropic Computer Use’s release, providing an open-source implementation for any model.

Skyvern

github.com/Skyvern-AI/skyvern

AI agent for browser workflows using computer vision and LLMs — no CSS selectors or DOM knowledge needed. Visual workflow builder. Inspired by BabyAGI/AutoGPT but grounded in visual understanding.

Stagehand (Browserbase)

TypeScript-first browser automation framework for AI agents. Pairs well with browser-use (Python). Strong developer adoption in 2025.

Key Blog Posts & Practitioner Resources (2024-2026)

Anthropic: Building Effective Agents (Dec 2024)

Erik Schluntz & Barry Zhang (Anthropic) · anthropic.com/research/building-effective-agents

The most practical and widely-cited 2024 post on agent design. Defines the crucial distinction:

Workflows = multiple LLMs orchestrated with pre-defined patterns
Agents = LLMs that dynamically direct their own processes and tool usage

Describes 5 core workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer) and when to build truly autonomous agents vs. structured workflows. Essential reading.

Google Cloud: Lessons from 2025 on Agents and Trust (Dec 2025)

Google Cloud CTO Office · cloud.google.com

Retrospective from Google’s enterprise deployments. Key insight: agents need “agent undo stacks” — idempotent tools and checkpointing that trigger safe rollbacks on failure. “AI grew up and got a job in 2025.”

Agentic AI, MCP, and Spec-Driven Development (Jan 2026)

GitHub Blog · github.blog

GitHub’s summary of the most-read technical content of 2025. The top themes: agentic coding, MCP adoption, and spec-driven development (writing AGENTS.md files to guide coding agents).

What We Learned from a Year of Building with LLMs (2024)

Eugene Yan, Bryan Bischof, Charles Frye, Hamel Husain et al. · applied-llms.org

Practitioner wisdom from engineers who shipped production LLM systems. Heavy on agent reliability, evals, and avoiding common failure modes. Widely cited as the most actionable guide of 2024.

AI in Production: Frameworks, Protocols, and What Actually Works in 2026

47 Billion · 47billion.com

Four months of hands-on production building. Verdict: “the agent landscape in 2025 is simultaneously more capable and more fragile than the marketing suggests.”

New Papers: 2025 Surveys & Architecture Research

Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions (Oct 2025)

Abou Ali & Dornaika · arXiv:2510.25445

End-to-end survey covering January 2018–March 2025. Covers classical symbolic and modern LLM-orchestrated frameworks.

Agentic AI Frameworks: Architectures, Protocols, and Design Challenges (Aug 2025)

Derouiche et al. · arXiv:2508.10146

Focused survey on the architectural and protocol layer — MCP, A2A, orchestration patterns.

Architectures for Building Agentic AI (Dec 2025)

arXiv:2512.09458

Proposes a practical taxonomy: tool-using agents, memory-augmented agents, planning/self-improvement agents, multi-agent systems, and embodied/web agents. Analyzes how each reshapes system design.

AI Agents vs. Agentic AI: A Conceptual Taxonomy (May 2025)

arXiv:2505.10468

Distinguishes AI agents (single-purpose, reactive) from Agentic AI (multi-agent collaboration, dynamic decomposition, persistent memory, coordinated autonomy). Useful conceptual clarification.

Towards a Science of Scaling Agent Systems (Dec 2025)

Kim et al. · arXiv:2512.08296

Examines how agent system performance scales with the number of agents, compute, and interaction patterns.

The 2025 AI Agent Index (Feb 2026)

arXiv:2602.17753

Documents technical and safety features of deployed agentic AI systems across product overview, company accountability, technical capabilities, autonomy & control, ecosystem interaction, and safety evaluation.

AI Agent Systems: Architectures, Applications, and Evaluation (Jan 2026)

arXiv:2601.01743

Synthesizes emerging agent architectures for reasoning, planning, tool use, and deployment. Analyzes system-level trade-offs: autonomy vs. controllability, latency vs. reliability, capability vs. safety.

MIT AI Agent Index (2025)

MIT · aiagentindex.mit.edu

Systematic evaluation of 30 prominent AI agents based on publicly available information. Documents origins, design, capabilities, ecosystem, and safety features. Key findings:

24/30 agents were released or received major agentic updates in 2024-2025
Autonomy split: Chat agents at Level 1-3; browser agents at Level 4-5; enterprise agents from Level 1-2 (design) to Level 3-5 (deployment)
Transparency gap: Of 13 frontier-autonomy agents, only 4 disclose any agentic safety evaluations. Developers share far more about capabilities than safety.
Foundation model concentration: Nearly all depend on GPT, Claude, or Gemini — structural ecosystem risk
No web conduct standards: Some agents explicitly designed to bypass anti-bot protections
Geographic divergence: 21/30 US-based, 5/30 Chinese, markedly different safety frameworks

Companion paper: arXiv:2602.17753

Safety in 2025-2026

Safety — prompt injection, trust hierarchies, reversibility, the transparency gap — became a first-class concern as agents took real-world actions. Full coverage in Safety & Alignment →

METR: Task Time Horizons as the New Benchmark

METR (Model Evaluation & Threat Research) · metr.org/time-horizons

METR proposed measuring agents not by task accuracy but by task time horizon — the longest task an agent can complete with 50% reliability. This reframing captures something accuracy-based benchmarks miss: can agents sustain coherent action over minutes, hours, or days?

Key findings (March 2025): - Claude 3.7 Sonnet: ~50 minutes time horizon (50%-task-completion time horizon, per arXiv:2503.14499) - GPT-5 (later 2025): ~2 hours 17 minutes - The time horizon has been doubling approximately every 7 months since 2019 - Extrapolation: agents capable of multi-day tasks within a few years

This metric directly maps to real-world utility. A 1-hour agent can do a coding task or research summary. A 24-hour agent could run a full experiment pipeline or manage a complex project.

See Science & Research Agents → for more.

OpenAI Harness Engineering & Long-Horizon Codex (2025)

OpenAI Harness Engineering

OpenAI · openai.com/index/harness-engineering

OpenAI’s own internal experience with agentic software development. A team of 3 (growing to 7) engineers built a product with zero manually-written lines of code, over 1M+ lines of generated code, in ~5 months — with agents handling architecture, implementation, testing, and PR review.

Documents the workflow, failure modes, and lessons learned from living at the frontier of AI-assisted development. Key insight: at this level, the human’s job shifts from writing code to writing specs and reviewing agent outputs.

Running Long-Horizon Tasks with Codex

OpenAI · developers.openai.com/blog/run-long-horizon-tasks-with-codex

Guidance on using OpenAI Codex for extended, multi-step software tasks. Covers context management, checkpointing, failure recovery, and task decomposition for tasks that span hours rather than minutes.

From Vibe Coding to Agentic Engineering

GLM-5: From Vibe Coding to Agentic Engineering (2026)

GLM-5-Team (Zhipu AI & Tsinghua University) · arXiv:2602.15763

The GLM team (China’s Zhipu AI) traces the transition from informal prompt-based “vibe coding” to systematic agentic software engineering — with structured verification loops, multi-agent coordination, and principled task decomposition. Their GLM series is competitive with international frontier models on coding benchmarks, making this a significant non-Western perspective on the agentic coding frontier.

What’s Next

The open questions for 2026 and beyond:

Long-horizon reliability — agents still fail on tasks requiring 100+ sequential steps
Trust & verification — how do you audit what an autonomous agent did?
Cost — complex agent workflows burn through API credits fast
Benchmark saturation — as SWE-bench fills up, what’s the next hard benchmark?
Agent interoperability at scale — MCP + A2A adoption race
Safety for real-world deployment — reversibility, sandboxing, human oversight patterns

References

Papers & Benchmarks

SWE-bench: Resolving Real-World GitHub Issues (Jimenez et al., 2023) — arXiv:2310.06770
SWE-agent: Agent-Computer Interface Enables Automated Software Engineering (Yang et al., 2024, Princeton NLP) — arXiv:2405.15793
Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions (Abou Ali & Dornaika, 2025) — arXiv:2510.25445
Agentic AI Frameworks: Architectures, Protocols, and Design Challenges (Derouiche et al., 2025) — arXiv:2508.10146
Architectures for Building Agentic AI (2025) — arXiv:2512.09458
AI Agents vs. Agentic AI: A Conceptual Taxonomy (2025) — arXiv:2505.10468
Towards a Science of Scaling Agent Systems (Kim et al., 2025) — arXiv:2512.08296
The 2025 AI Agent Index (2026) — arXiv:2602.17753
AI Agent Systems: Architectures, Applications, and Evaluation (2026) — arXiv:2601.01743
GLM-5: From Vibe Coding to Agentic Engineering (GLM-5-Team, Zhipu AI & Tsinghua University, 2026) — arXiv:2602.15763
From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent (Yang et al., 2025; independent academic overview) — arXiv:2505.02024

Open-Source Projects & Frameworks

SWE-agent — GitHub: princeton-nlp/SWE-agent
OpenHands (formerly OpenDevin) — GitHub: All-Hands-AI/OpenHands
browser-use — GitHub: browser-use/browser-use
Skyvern — GitHub: Skyvern-AI/skyvern
Aider — aider.chat · GitHub: paul-gauthier/aider
Goose — GitHub: block/goose
LangGraph — GitHub: langchain-ai/langgraph
Google ADK (Agent Development Kit) — google.github.io/adk-docs · GitHub: google/adk-python

Official Framework & Protocol Documentation

Model Context Protocol (MCP) — modelcontextprotocol.io
Agent2Agent Protocol (A2A) — developers.googleblog.com · Donated to Linux Foundation Jun 2025
Microsoft Agent Framework — azure.microsoft.com/blog/introducing-microsoft-agent-framework
Claude Agent SDK — anthropic.com/engineering/building-agents-with-the-claude-agent-sdk
OpenAI Operator & Deep Research — Introducing Operator · Introducing Deep Research · Introducing ChatGPT Agent
Anthropic Computer Use — anthropic.com
GitHub Copilot Coding Agent — github.blog

Key Blog Posts & Practitioner Resources

Building Effective Agents (Erik Schluntz & Barry Zhang, Anthropic, Dec 2024) — anthropic.com/research/building-effective-agents
- Simon Willison’s summary: simonwillison.net/2024/Dec/20/building-effective-agents/
Reference Implementations for Agent Patterns (Anthropic Cookbook) — github.com/anthropics/anthropic-cookbook
Lessons from 2025 on Agents and Trust (Google Cloud CTO Office, Dec 2025) — cloud.google.com
Agentic AI, MCP, and Spec-Driven Development (GitHub Blog, Jan 2026) — github.blog
What We Learned from a Year of Building with LLMs (Eugene Yan et al., 2024) — applied-llms.org
Model Context Protocol Overview (Simon Willison, Nov 2024) — simonwillison.net/2024/Nov/25/model-context-protocol/
AI in Production: Frameworks, Protocols, and What Actually Works in 2026 (47 Billion) — 47billion.com/blog/ai-agents-in-production-frameworks-protocols-and-what-actually-works-in-2026/
Agents 101: How to Work with Coding Agents (Cognition AI) — devin.ai/agents101

Enterprise Products & Evaluations

MIT AI Agent Index (2025) — aiagentindex.mit.edu
- Companion paper: The 2025 AI Agent Index (Staufer et al., 2026) — arXiv:2602.17753
Devin — cognition.ai
Manus AI — manus.im
Aider Coding Leaderboard — aider.chat/docs/leaderboards

Measurement & Research

METR: Task Time Horizons as a Measure of Agent Capability (METR, 2025) — metr.org/time-horizons
OpenAI Harness Engineering — openai.com/index/harness-engineering
Running Long-Horizon Tasks with Codex (OpenAI) — developers.openai.com/blog/run-long-horizon-tasks-with-codex

Back to Overview → · See Resources → for a curated reading list