Multi-Agent Systems & Frameworks
Multiple LLMs collaborating — and the tools to build them
Overview
Multi-agent systems represent a fundamental shift in AI architecture: instead of one LLM doing everything, specialized agents collaborate, each contributing expertise. This area exploded in 2023-2024, moving from academic curiosity to production frameworks used by thousands of developers.
Key themes:
- Role specialization: Planner, executor, critic, coder, tester — each agent excels at one thing
- Communication & coordination: How agents talk to each other, structure messages, and avoid confusion
- Emergent behavior: What happens when agents interact — including both cooperation and failure modes
- Framework maturation: From research prototypes (AutoGPT) to production systems (LangGraph, AutoGen)
Foundational Multi-Agent Papers
CAMEL: Communicative Agents for “Mind” Exploration (2023)
Li et al. · arXiv:2303.17760 · NeurIPS 2023
One of the first systematic studies of autonomous multi-agent cooperation. Introduces inception prompting to guide two agents (one playing a user, one an assistant) to collaboratively complete tasks without human intervention. Also a framework for generating conversational datasets.
- Key ideas: Role-playing enables autonomous cooperation; inception prompting specifies agent personas; emergent task completion without human oversight
- GitHub: camel-ai/camel
Generative Agents: Interactive Simulacra of Human Behavior (2023)
Park et al. · arXiv:2304.03442 · UIST 2023 Best Paper
25 LLM-powered agents living in a simulated town (“Smallville”). Agents plan daily schedules, form relationships, remember experiences, and exhibit emergent social behaviors. Landmark demonstration of agent societies.
- Key ideas: Hierarchical memory (observation → reflection → planning); agents spontaneously organize events (Valentine’s Day party emerged from initial prompt about one agent); believable human-like behavior
- Influence: Every subsequent work on agent memory architecture cites this paper
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (2023)
Hong et al. · arXiv:2308.00352
Software development with LLM agents assigned roles mirroring a real software company: Product Manager, Architect, Engineer, QA. Agents share structured documents (PRD, architecture spec) rather than just conversations. Reduces hallucination and improves code quality.
- Key ideas: Standard Operating Procedures (SOPs) constrain agent workflows; structured outputs (not just text); document-driven coordination; role-based execution pipeline
- GitHub: geekan/MetaGPT — 45k+ stars
ChatDev: Communicative Agents for Software Development (2023)
Qian et al. · arXiv:2307.07924 · ACL 2024
Complete software development lifecycle with specialized roles: CEO, CTO, Programmer, Reviewer, Tester. Agents use both natural language and programming language to communicate. Produces functional software from text descriptions in minutes.
- Key ideas: SOP-constrained agent behavior; mixed natural/programming language communication; role specialization across SDLC phases
- GitHub: OpenBMB/ChatDev
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (2023)
Wu et al. (Microsoft) · arXiv:2308.08155
Framework for building conversable agents that can interact with each other and humans. Agents are configurable (roles, tools, human-input modes). Enables complex workflows: code generation → execution → debugging → verification, all between agents.
- Key ideas: Conversable agent abstraction; human-in-the-loop anywhere; flexible conversation patterns (sequential, group chat, nested); code execution sandbox
- GitHub: microsoft/autogen — 40k+ stars
AgentVerse: Facilitating Multi-Agent Collaboration (2023)
Chen et al. · arXiv:2308.10848
Platform for orchestrating agents in diverse scenarios and studying emergent behaviors — both positive (division of labor, error correction) and negative (groupthink, cascade failures). An early systematic study of both cooperative and failure modes in multi-agent systems. (first-mover claim unverified)
- GitHub: OpenBMB/AgentVerse
DyLAN: Dynamic LLM-Powered Agent Network (2023)
Liu et al. · arXiv:2310.02170
Dynamic agent selection for each reasoning step — only the agents most relevant to the current sub-task are activated. More efficient than static all-agent participation.
- Key ideas: Dynamic agent recruitment; task-adaptive networks; reduces cost vs. full-team approaches
Solo Performance Prompting (SPP) (2023)
Wang et al. · arXiv:2307.05300
Single LLM plays multiple expert personas sequentially — extracting multi-agent benefits from one model. Improves factuality and reasoning without requiring multiple model instances.
AgentScope: A Flexible yet Robust Multi-Agent Platform (2024)
Alibaba · arXiv:2402.14034
Production-oriented multi-agent platform with fault tolerance, distributed execution, and operator/developer separation. Built for real-world scale.
TapeAgents: Tape-Centric Agent Framework (2024)
Bahdanau, Gontier, Huang et al. (ServiceNow) · arXiv:2412.08445
Agent framework built around a structured log tape — the tape is simultaneously the session log, resumable state, and development artifact. Agents append thought/action steps to the tape; the environment appends observations back. This tape-centric design supports the full agent lifecycle: development (debugging, auditing), post-deployment (evaluation, fine-tuning), and cross-agent knowledge transfer (adapt tapes from other agents).
Has echoes of classical blackboard systems in AI — shared structured state that multiple agents read and write. The tape makes agent behavior transparent, reproducible, and reusable in ways that ad-hoc logging doesn’t.
Open-Source Frameworks
These are the tools practitioners actually build with:
LangChain
github.com/langchain-ai/langchain — 95k+ stars (as of early 2025)
The most widely used LLM application framework. Provides chains, agents, tools, and memory as composable building blocks. The Agents module supports ReAct, structured output, and tool-calling agents.
- Philosophy: Composability — mix and match chains, tools, memory
- Best for: Rapid prototyping, broad ecosystem of integrations (100+ tools)
- Limitation: Can be opaque; complex chains are hard to debug
LangGraph
github.com/langchain-ai/langgraph — 12k+ stars (as of early 2025)
Graph-based framework for stateful, multi-actor workflows. Nodes are agents/functions; edges are transitions. Supports cycles (loops), branching, and human-in-the-loop checkpoints. Production-ready with persistent checkpoints.
- Philosophy: Explicit state machines — you control the flow
- Best for: Complex multi-agent workflows, production deployments, long-running agents
- Key feature: Persistent checkpoints enable fault recovery
AutoGen (Microsoft)
github.com/microsoft/autogen — 40k+ stars (as of early 2025)
Conversable agent framework for multi-agent conversations. AutoGen v0.4 (2024) is a complete rewrite with async-first, event-driven architecture.
- Philosophy: Agents as participants in a conversation
- Best for: Code generation + execution workflows, research, human-in-the-loop systems
- Notable: AutoGen Studio provides a no-code GUI for building agent workflows
CrewAI
github.com/crewAIInc/crewAI — 40k+ stars (as of early 2025)
Role-based agent framework with a high-level “crew” abstraction. Define agents with roles, goals, and backstories; assign tasks; the crew coordinates to complete the objective.
- Philosophy: Agents as a team with defined roles
- Best for: Business process automation, structured workflows with clear role boundaries
- Simplicity: Lower learning curve than LangGraph for basic use cases
DSPy (Stanford)
github.com/stanfordnlp/dspy — 22k+ stars (as of early 2025)
Programming language models via declarative modules and automatic optimization. Instead of hand-crafting prompts, you define a program and DSPy optimizes the prompts automatically.
- Philosophy: LLM programs, not prompts; optimize systematically
- Best for: Applications where you want systematic prompt optimization, research
Semantic Kernel (Microsoft)
github.com/microsoft/semantic-kernel — 23k+ stars (as of early 2025)
Enterprise-focused SDK for AI-powered applications. Strong .NET/C# support alongside Python. Integrates with Azure AI, OpenAI, and Hugging Face.
- Philosophy: Plugins and planners; enterprise integration
- Best for: Enterprise deployments, .NET applications
Haystack (deepset)
github.com/deepset-ai/haystack — 18k+ stars (as of early 2025)
Pipeline-based framework for NLP applications with strong RAG and document processing support. v2.0 (2024) redesigned as composable components.
Agno
agno-agi · github.com/agno-agi/agno
Full-stack platform for agentic software: framework + production runtime (FastAPI-based, session-scoped) + control plane (AgentOS UI). Builds stateful agents with memory, 100+ tool integrations, guardrails, and MCP support in ~20 lines. Model-agnostic. See Community Agents → for full coverage.
OpenAI Swarm (2024)
Experimental (educational) framework from OpenAI for multi-agent orchestration. Lightweight and minimal. Demonstrates handoffs — one agent transferring control to another based on context.
- Philosophy: Minimalism; focus on handoffs and routines
- Status: Experimental/educational, not production-intended
Production Case Study: Anthropic’s Multi-Agent Research System (2025)
Anthropic Engineering · anthropic.com/engineering/multi-agent-research-system
One of the most detailed public engineering accounts of building a real production multi-agent system — Anthropic’s own Claude Research feature. Essential reading.
Architecture: Orchestrator-Worker
The Research system uses an orchestrator Claude (Opus 4) that plans a research strategy, then spawns parallel worker agents (Sonnet 4) to pursue different sub-questions simultaneously. Each worker has its own context window and exploration trajectory. Results are compressed back to the orchestrator for synthesis.
Why Multi-Agent? The Core Argument
“The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent.”
Three reasons multi-agent excels for research: 1. Breadth-first parallelism — multiple threads explored simultaneously 2. Context window scaling — total context across agents >> single agent window 3. Path independence — separate explorations avoid getting stuck in the same dead-ends
Quantified Results
| Metric | Value |
|---|---|
| Multi-agent (Opus 4 + Sonnet 4 workers) vs. single Opus 4 | +90.2% on internal research eval |
| Token use: agents vs. chat | ~4× |
| Token use: multi-agent vs. chat | ~15× |
| BrowseComp variance explained by token count alone | 80% |
The 80% figure is striking: raw token budget explains most of hard web research performance, not clever prompting or architecture. Upgrading from Sonnet 3.7 to Sonnet 4 is a larger gain than doubling the token budget on 3.7.
When Multi-Agent Works (and Doesn’t)
Best for: - Heavy parallelization — many independent sub-queries - Information that exceeds single context windows - Interfacing with numerous complex tools
Poor fit: - Many cross-agent dependencies (agents need to share context frequently) - Real-time coordination between agents - Most coding tasks (fewer truly parallelizable sub-tasks than research)
Engineering Lessons
- Tool design matters as much as prompting — poorly designed tools cause cascading failures
- Evaluation is uniquely hard — open-ended research quality is difficult to measure automatically
- Economic viability gate — multi-agent is only justified for high-value tasks (15× token cost)
- Subagent separation of concerns — distinct prompts and tools per worker, not a monolith
Orchestration Patterns
The literature has converged on several reusable patterns:
Pattern 1: Planner → Executor
A planner agent generates a structured plan; executor agents carry it out. Clean separation of concerns, easy to monitor.
Pattern 2: Planner → Executor → Critic
Adds a critic/reviewer agent that evaluates outputs and sends feedback. Used in MetaGPT, ChatDev.
Pattern 3: Hub and Spoke
One orchestrator agent routes tasks to specialist agents. The hub maintains context; spokes are stateless workers. Used in HuggingGPT, many LangGraph workflows.
Pattern 4: Peer Debate
Multiple agents argue different positions and converge on a consensus. Improves factual accuracy. Used in debates for math, science questions.
Pattern 5: Hierarchical Teams
Teams of teams. Each sub-team has an internal coordinator; sub-teams report to a top-level orchestrator. Scales to complex, long-horizon tasks.
Model Context Protocol (MCP)
Anthropic · Announced November 2024
MCP is an open standard for how AI agents communicate with external tools and data sources. Rather than each agent/framework implementing custom tool integrations, MCP provides a universal interface: servers expose tools/resources; agents connect as clients.
- Analogy: MCP is to agents what HTTP is to web browsers
- Impact: By early 2025, hundreds of MCP servers existed (GitHub, Slack, databases, file systems)
- Adoption: OpenAI, Google, and major framework builders endorsed MCP
- Docs: modelcontextprotocol.io
Evaluation & Benchmarks
AgentBench (2023)
Liu et al. · arXiv:2308.03688
8-task benchmark spanning operating systems, databases, web browsers, card games, and more. Revealed a large gap between GPT-4 and open-source models on agent tasks.
WebArena (2023)
Zhou et al. · arXiv:2307.13854
Realistic web environment with 5 functional websites (e-commerce, Reddit, GitLab, CMS, map) plus Wikipedia as a reference resource. 812 long-horizon tasks. Baseline GPT-4 success: ~14%.
GAIA: A Benchmark for General AI Assistants (2023)
Mialon et al. · arXiv:2311.12983 · ICLR 2024
Tasks requiring multi-step reasoning, tool use, and common sense. Humans achieve 92%; GPT-4+plugins achieved ~15% at release (arXiv:2311.12983). Tests true general-purpose assistant ability.
OSWorld (2024)
Xie et al. · arXiv:2404.07972
Desktop GUI tasks across Windows/macOS/Linux. 369 real computer tasks. Baseline models achieved <10% at release; by late 2024 Claude reached 22%+, and specialized models pushed toward 72%.
ST-WebAgentBench (2024)
Safety and trustworthiness evaluation for web agents. Tests whether agents follow safety instructions and avoid harmful actions.
References
Papers
Multi-Agent Systems & Frameworks
- CAMEL: Communicative Agents for “Mind” Exploration (Li et al., 2023) — arXiv:2303.17760
- Generative Agents: Interactive Simulacra of Human Behavior (Park et al., 2023) — arXiv:2304.03442
- MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (Hong et al., 2023) — arXiv:2308.00352
- ChatDev: Communicative Agents for Software Development (Qian et al., 2023) — arXiv:2307.07924
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al., 2023) — arXiv:2308.08155
- AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors (Chen et al., 2023) — arXiv:2308.10848
- A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration (Liu et al., 2023) — arXiv:2310.02170 — Published COLM 2024
- Solo Performance Prompting (Wang et al., 2023) — arXiv:2307.05300
- AgentScope: A Flexible yet Robust Multi-Agent Platform (Alibaba, 2024) — arXiv:2402.14034
- TapeAgents: Tape-Centric Agent Framework (Bahdanau et al., 2024) — arXiv:2412.08445
Evaluation & Benchmarks
- AgentBench: Evaluating LLMs as Agents (Liu et al., 2023) — arXiv:2308.03688
- WebArena: A Realistic Web Environment for Building Autonomous Agents (Zhou et al., 2023) — arXiv:2307.13854
- GAIA: A Benchmark for General AI Assistants (Mialon et al., 2023) — arXiv:2311.12983
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Xie et al., 2024) — arXiv:2404.07972
- ST-WebAgentBench: Towards Evaluating Safety and Trustworthiness for Autonomous Web Agents (2024) (source needed)
Blog Posts & Articles
- Anthropic: Building Effective Agents — anthropic.com/engineering/multi-agent-research-system
Frameworks & Open-Source Projects
- LangChain — github.com/langchain-ai/langchain
- LangGraph — github.com/langchain-ai/langgraph
- AutoGen (Microsoft) — github.com/microsoft/autogen
- CrewAI — github.com/crewAIInc/crewAI
- DSPy (Stanford) — github.com/stanfordnlp/dspy
- Semantic Kernel (Microsoft) — github.com/microsoft/semantic-kernel
- Haystack (deepset) — github.com/deepset-ai/haystack
- Agno — github.com/agno-agi/agno
- OpenAI Swarm — github.com/openai/swarm
- CAMEL — github.com/camel-ai/camel
- MetaGPT — github.com/geekan/MetaGPT
- ChatDev — github.com/OpenBMB/ChatDev
- AgentVerse — github.com/OpenBMB/AgentVerse
Standards & Specifications
- Model Context Protocol (MCP) (Anthropic, 2024) — modelcontextprotocol.io
Full chronology in the Timeline →. Continue to Memory, Tools & Actions →