Field Timeline
Key milestones in LLM agent research and deployment, 2021–2026
A consolidated chronology of the major papers, releases, frameworks, benchmarks, products, and protocols that shaped the field. Click any linked item for more detail.
2021
| Date | Event | Significance |
|---|---|---|
| Dec 2021 | WebGPT (OpenAI) | First major browser-using LLM agent; RLHF for browsing behavior |
2022
| Date | Event | Significance |
|---|---|---|
| Feb 2022 | Chain-of-Thought Prompting (Wei et al.) | Foundation of all LLM reasoning work; step-by-step prompting unlocks emergent capability |
| Apr 2022 | SayCan (Google) | Embodied agent: LLM proposes actions, affordance model filters by physical feasibility |
| May 2022 | MRKL Systems (Karpas et al.) | LLM as orchestrator of specialized symbolic modules — ancestor of tool-calling |
| May 2022 | Zero-Shot CoT (“Let’s think step by step”) | Remarkably simple — adding one sentence enables multi-step reasoning |
| Jun 2022 | Self-Consistency (Wang et al.) | Sample many reasoning paths, majority vote; significant reliability boost |
| Jul 2022 | Inner Monologue (Huang et al.) | Robots form natural-language inner monologues for planning with environment feedback |
| Oct 2022 | ReAct (Yao et al.) | The canonical agent loop: interleaved Thought / Action / Observation |
| Oct 2022 | Decomposed Prompting (Khot et al.) | Modular sub-prompt decomposition |
2023 (Early)
| Date | Event | Significance |
|---|---|---|
| Jan 2023 | Least-to-Most Prompting (Zhou et al.) | Sequential subproblem decomposition; better generalization than CoT |
| Feb 2023 | Toolformer (Schick et al.) | Self-supervised tool learning — LLMs teach themselves to call APIs |
| Mar 2023 | CAMEL (Li et al.) | First major role-playing communicative multi-agent paper |
| Mar 2023 | HuggingGPT / JARVIS (Shen et al.) | ChatGPT orchestrates hundreds of Hugging Face models |
| Mar 2023 | AutoGPT goes viral | Explosive public interest in autonomous agents; exposed failure modes |
| Mar 2023 | GPT-4 released | Capability threshold enabling reliable agent reasoning |
| Apr 2023 | Generative Agents (Park et al.) | 25 agents in a simulated town; foundational episodic memory architecture |
| Apr 2023 | BabyAGI (Nakajima) | Minimalist task management agent; viral open-source project |
| Apr 2023 | Reflexion (Shinn et al.) | Verbal RL: agents reflect on failures; HumanEval 80%→91% (GPT-4 baseline to Reflexion) |
| May 2023 | Tree of Thoughts (Yao et al.) | Branching search over reasoning steps; Game of 24: 4%→74% |
| May 2023 | Reasoning via Planning / RAP (Hao et al.) | LLM as world model for MCTS-based planning |
| May 2023 | Voyager (Wang et al.) | GPT-4 in Minecraft with lifelong skill library accumulation |
| May 2023 | Self-Refine (Madaan et al.) | Iterative self-critique without additional training |
| May 2023 | Gorilla LLM (Patil et al.) | Fine-tuned LLM for reliable API calling |
| May 2023 | Large Language Models as Tool Makers (Cai et al.) | Agents create reusable tools |
2023 (Mid–Late)
| Date | Event | Significance |
|---|---|---|
| Jun 2023 | OpenAI function calling | Industry-standard tool use API; enabled the framework ecosystem |
| Jun 2023 | LLM Powered Autonomous Agents — Lilian Weng | Most-cited blog post introduction to the field |
| Jun 2023 | CRITIC (Gou et al.) | Tool-grounded self-critique — verification via web search and code execution |
| Jun 2023 | RestGPT | REST API orchestration |
| Jul 2023 | ChatDev (Qian et al.) | Multi-agent software development team |
| Jul 2023 | ToolLLM (Qin et al.) | 16,000+ REST API coverage |
| Jul 2023 | WebArena (Zhou et al.) | 5-site web benchmark (+ Wikipedia); GPT-4 baseline ~14% |
| Aug 2023 | MetaGPT (Hong et al.) | SOP-driven multi-agent software engineering |
| Aug 2023 | AutoGen (Wu et al., Microsoft) | Conversable agents framework |
| Aug 2023 | AgentVerse (Chen et al.) | Multi-agent dynamics study; first systematic failure mode analysis |
| Aug 2023 | AgentBench (Liu et al.) | 8-task multi-domain agent benchmark |
| Aug 2023 | A Survey on Large Language Model based Autonomous Agents (Wang et al., 2308.11432) | First comprehensive survey; 200+ papers, Brain/Perception/Memory/Action framework |
| Sep 2023 | The Rise and Potential of LLM-Based Agents (Xi et al., 2309.07864) | 86-page survey; agent societies vision |
| Sep 2023 | Cognitive Architectures for Language Agents (Sumers et al.) | ACT-R/SOAR grounding for agent memory |
| Oct 2023 | Reflexion final paper (NeurIPS) | |
| Oct 2023 | MemGPT (Packer et al.) | LLM as OS: hierarchical memory paging; unlimited effective context |
| Oct 2023 | Step-Back Prompting | Abstract before solving; +7% MMLU Physics, +27% TimeQA (PaLM-2L) |
| Oct 2023 | Self-RAG (Asai et al.) | Selective retrieval with self-critique |
| Oct 2023 | SWE-bench (Jimenez et al.) | GitHub issue resolution benchmark; defines the coding agent race |
| Oct 2023 | GAIA: A Benchmark for General AI Assistants (Mialon et al.) | General AI assistant tasks; humans 92%, GPT-4+plugins ~30% |
| Nov 2023 | DyLAN | Dynamic agent selection per reasoning step |
| Dec 2023 | CogAgent, AppAgent | Specialist GUI models for mobile/desktop |
2024 (Early)
| Date | Event | Significance |
|---|---|---|
| Jan 2024 | AgentScope (Alibaba) | Production-oriented multi-agent platform |
| Feb 2024 | CrewAI released | Role-based agent framework; high-level crew abstraction |
| Feb 2024 | ReadAgent (Google) | Gist memory extending effective context 3.5–20× |
| Feb 2024 | CodeAct (Wang et al.) | Code as unified action space; more expressive than discrete tool calls |
| Feb 2024 | AnyTool | Hierarchical API selection from thousands of options |
| Feb 2024 | OS-Copilot | General computer control agent |
| Mar 2024 | Devin (Cognition AI) | “First AI software engineer”; validates commercial demand; initial SWE-bench claims later revised |
| Apr 2024 | OSWorld (Xie et al.) | Desktop GUI tasks across real apps; agents start <10% |
| Apr 2024 | SWE-bench Pro (Scale) | Harder private-codebase variant to combat benchmark overfitting |
| May 2024 | SWE-agent (Yang et al., Princeton) | Open-source coding agent; ~12% on SWE-bench; ACI design |
2024 (Mid–Late)
| Date | Event | Significance |
|---|---|---|
| Jun 2024 | LangGraph gains production adoption | Graph-based stateful agent workflows |
| Aug 2024 | Plaat et al. survey starts circulating | Reason–Act–Interact taxonomy; published JAIR Dec 2025 |
| Aug 2024 | Scaling LLM Test-Time Compute (Snell et al.) | Targeted compute scaling can beat a larger model |
| Oct 2024 | Anthropic Computer Use | First major LLM provider with native computer control |
| Oct 2024 | OpenAI Swarm (experimental) | Minimal multi-agent handoff framework |
| Nov 2024 | MCP (Model Context Protocol) (Anthropic) | Open standard: universal tool/data connection for agents |
| Dec 2024 | Building Effective Agents (Anthropic) | Defining practitioner post; workflows vs. agents distinction; 5 workflow patterns |
| Jan 2025 | DeepSeek-R1 | Open RL-trained reasoning model matching o1 on math; “aha moments” emerge from GRPO (arXiv:2501.12948) |
2025
| Date | Event | Significance |
|---|---|---|
| Jan 2025 | OpenAI Operator launch | Browser-use agent; CUA (Computer-Using Agent) model |
| Jan 2025 | Goose (Block / Jack Dorsey) | Open-source agent framework; extensible, LLM-agnostic |
| Feb 2025 | OpenAI Deep Research | o3-powered web research agent; hours of research in minutes |
| Mar 2025 | Manus AI launch | General-purpose autonomous agent; described as “turning point”; acquired by Meta Dec 2025 |
| Mar 2025 | METR time horizons paper | Agent capability doubling every 7 months; Claude 3.7 Sonnet at ~1hr horizon |
| Apr 2025 | Google A2A (Agent2Agent) protocol | Agent-to-agent communication standard; donated to Linux Foundation Jun 2025 |
| Apr 2025 | PaperCoder (Seo et al., ICLR 2026) | Multi-agent framework: ML papers → working code repositories |
| May 2025 | GitHub Copilot Coding Agent | Autonomous coding in VS Code, Xcode, JetBrains, Eclipse |
| Jun 2025 | A2A → Linux Foundation | Neutral governance for agent interoperability |
| Jul 2025 | ChatGPT Agent | Operator + Deep Research merged into unified general-purpose agent |
| Sep 2025 | Claude Agent SDK | Claude Code generalized to full agent harness |
| Oct 2025 | Microsoft Agent Framework | AutoGen + Semantic Kernel merged; production-ready |
| Oct 2025 | GitHub Agent HQ (Universe 2025) | Unified agent orchestration within GitHub |
| Nov 2025 | LangGraph v1.0 | Production-ready release; enterprise adoption accelerates |
| Nov 2025 | Gemini 3 Pro + Live SWE-agent: 77.4% | Major SWE-bench Verified milestone |
| Dec 2025 | Meta acquires Manus AI | |
| Dec 2025 | Plaat et al. published in JAIR | Peer-reviewed survey; Reason–Act–Interact taxonomy |
| Dec 2025 | Google Cloud “Lessons from 2025” | Production retrospective; agent undo stacks, reversibility as design principle |
2026 (to March)
| Date | Event | Significance |
|---|---|---|
| Jan 2026 | Multiple 2025 retrospectives | Industry-wide reflection on agent deployment learnings |
| Feb 2026 | MIT AI Agent Index published | 30 agents documented; transparency gap exposed; autonomy levels mapped |
| Mar 2026 | Anthropic Code Review for Claude Code | Parallel multi-agent PR review; multi-agent coding goes mainstream |
| Mar 2026 | This survey compiled |
References
Foundational Papers (2021–2022)
- WebGPT: Browser-Assisted Question-Answering with Large Language Models — arXiv:2112.09332
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022) — arXiv:2201.11903
- MRKL Systems: A modular, neuro-symbolic architecture (Karpas et al., 2022) — arXiv:2205.00445
- Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022) — arXiv:2205.11916
- Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022) — arXiv:2203.11171
- SayCan: Grounding Language to Robotic Affordances (Ahn et al., 2022) — arXiv:2204.01691
- Inner Monologue: Embodied Reasoning through Planning with Language Models (Huang et al., 2022) — arXiv:2207.05608
- ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023) — arXiv:2210.03629
- Decomposed Prompting: A Modular Approach for Solving Complex Tasks (Khot et al., 2022) — arXiv:2210.02406
Early 2023 Breakthroughs
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models (Zhou et al., 2023) — arXiv:2205.10625
Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) — arXiv:2302.04761
CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Models (Li et al., 2023) — arXiv:2303.17760
HuggingGPT: Solving AI Tasks with Chatgpt and its Friends in Hugging Face (Shen et al., 2023) — arXiv:2303.17580
Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023) — arXiv:2303.11366
Generative Agents: Interactive Simulacra of Human Behavior (Park et al., 2023) — arXiv:2304.03442
BabyAGI (Nakajima, 2023) — GitHub Repository
Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023) — arXiv:2305.10601
Reasoning via Planning with Language Models (Hao et al., 2023) — arXiv:2305.04091
Voyager: An Open-Ended Embodied Agent with Large Language Models (Wang et al., 2023) — arXiv:2305.16291
Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., 2023) — arXiv:2305.00633
Gorilla: Large Language Model Connected with Massive APIs (Patil et al., 2023) — arXiv:2305.15334
Large Language Models as Tool Makers (Cai et al., 2023) — arXiv:2305.17126 ### Mid–Late 2023 Frameworks & Systems
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (Gou et al., 2023) — arXiv:2305.11738
RestGPT: An API Chaining Framework for LLM-Assisted API Applications (Xu et al., 2023) — arXiv:2306.06624
ChatDev: Communicative Agents for Software Development (Qian et al., 2023) — arXiv:2307.07924
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs (Qin et al., 2023) — arXiv:2307.16789
WebArena: A Realistic Web Environment for Building Autonomous Agents (Zhou et al., 2023) — arXiv:2307.13854
MetaGPT: The Multi-Agent Framework (Hong et al., 2023) — arXiv:2308.00352
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al., 2023) — arXiv:2308.08155
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents (Chen et al., 2023) — arXiv:2308.10848
A Survey on Large Language Model based Autonomous Agents (Wang et al., 2023) — arXiv:2308.11432
AgentBench: Evaluating LLMs as Agents (Liu et al., 2023) — arXiv:2308.03688
The Rise and Potential of Large Language Model Based Agents: A Survey (Xi et al., 2023) — arXiv:2309.07864
Cognitive Architectures for Language Agents (Sumers et al., 2023) — arXiv:2309.02427
MemGPT: Towards LLMs as Operating Systems (Packer et al., 2023) — arXiv:2310.08560
Step-Back Prompting: An Effective Technique for Complex Reasoning (Zou et al., 2023) — arXiv:2310.06117
Self-RAG: Learning to Retrieve, Generate, and Critique for Self-Improved Generation (Asai et al., 2023) — arXiv:2310.11511
SWE-bench: Resolving Real-World GitHub Issues (Jimenez et al., 2023) — arXiv:2310.06770
GAIA: A Benchmark for General AI Assistants (Mialon et al., 2023) — arXiv:2311.12983
DyLAN: Dynamic Language Agent Networks — arXiv:2310.02779
CogAgent: A Visual Language Model for GUI Agents (Hong et al., 2023) — arXiv:2312.08914
AppAgent: Multimodal Agents as Smartphone Users (Zhang et al., 2023) — arXiv:2312.13771
2024 Production & Scale
- AgentScope: A Flexible yet Robust Multi-Agent Plat form (Alibaba, 2024) — arXiv:2402.14034
- CrewAI (2024) — GitHub
- ReadAgent: Gist Memory for Extending Context Window of Large Language Models (Google Research, 2024) — arXiv:2402.09727
- CodeAct: Unified Language Models as Zero-shot Agents (Wang et al., 2024) — arXiv:2402.01030
- AnyTool: An LLM Agent that Can Flexibly Use Any API (Qian et al., 2024) — arXiv:2402.04253
- OS-Copilot: Towards Generalist Computer Agents with Open-Ended Learning (Zhu et al., 2024) — arXiv:2402.07456
- Devin: AI Software Engineer (Cognition AI, 2024) — Website
- OSWorld: Benchmarking Multimodal Agents in Real Computer Environments (Xie et al., 2024) — arXiv:2404.07972
- SWE-agent: An Open-Source Software Engineering Agent (Yang et al., 2024) — arXiv:2405.15793
- Scaling LLM Test-Time Compute for Improved Performance and Robustness (Snell et al., 2024) — arXiv:2408.14958
2024–2025 Frameworks & Standards
- LangGraph (LangChain, 2024) — Documentation
- Anthropic Computer Use (October 2024) — Research Post
- OpenAI Swarm (October 2024) — GitHub Repository
- Model Context Protocol (MCP) (Anthropic, November 2024) — Website
- Building Effective Agents (Anthropic, December 2024) — Blog Post
- Agents in Artificial Intelligence: Surveys and Open Problems (Plaat et al., 2025, JAIR) — arXiv:2503.23037
2025 Launches & Milestones
- DeepSeek-R1: Scaling Reasoning Capability of LLMs with Reinforcement Learning (DeepSeek, January 2025) — GitHub
- OpenAI Operator (January 2025) — Website
- Goose (Block / Jack Dorsey, January 2025) — GitHub
- Google A2A (Agent-to-Agent) Protocol (April 2025) — Google Cloud Blog
- PaperCoder: A Python Framework for Converting ML Papers to Working Code (Seo et al., ICLR 2026) — arXiv:2409.09381
- GitHub Copilot Coding Agent (May 2025) — Blog Post
- GitHub Agent HQ (October 2025, Universe 2025) — Blog Post
- LangGraph v1.0 (November 2025) — Documentation
Blog Posts & Resources
- LLM Powered Autonomous Agents Systems — Lilian Weng, 2023 — Blog
- Building Effective Agents — Anthropic, 2024 — Blog Post
- METR’s Time Horizons of AI Agents — March 2025 — Blog
- MIT AI Agent Index — February 2026 — Website
For a conceptual map of how these fit together, see the Taxonomy →. For deeper coverage by topic, use the navigation above.