Multi-Agent Systems & Frameworks

Multiple LLMs collaborating — and the tools to build them

Overview

Multi-agent systems represent a fundamental shift in AI architecture: instead of one LLM doing everything, specialized agents collaborate, each contributing expertise. This area exploded in 2023-2024, moving from academic curiosity to production frameworks used by thousands of developers.

Key themes:

Role specialization: Planner, executor, critic, coder, tester — each agent excels at one thing
Communication & coordination: How agents talk to each other, structure messages, and avoid confusion
Emergent behavior: What happens when agents interact — including both cooperation and failure modes
Framework maturation: From research prototypes (AutoGPT) to production systems (LangGraph, AutoGen)

Foundational Multi-Agent Papers

CAMEL: Communicative Agents for “Mind” Exploration (2023)

Li et al. · arXiv:2303.17760 · NeurIPS 2023

One of the first systematic studies of autonomous multi-agent cooperation. Introduces inception prompting to guide two agents (one playing a user, one an assistant) to collaboratively complete tasks without human intervention. Also a framework for generating conversational datasets.

Key ideas: Role-playing enables autonomous cooperation; inception prompting specifies agent personas; emergent task completion without human oversight
GitHub: camel-ai/camel

Generative Agents: Interactive Simulacra of Human Behavior (2023)

Park et al. · arXiv:2304.03442 · UIST 2023 Best Paper

25 LLM-powered agents living in a simulated town (“Smallville”). Agents plan daily schedules, form relationships, remember experiences, and exhibit emergent social behaviors. Landmark demonstration of agent societies.

Key ideas: Hierarchical memory (observation → reflection → planning); agents spontaneously organize events (Valentine’s Day party emerged from initial prompt about one agent); believable human-like behavior
Influence: Every subsequent work on agent memory architecture cites this paper

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (2023)

Hong et al. · arXiv:2308.00352

Software development with LLM agents assigned roles mirroring a real software company: Product Manager, Architect, Engineer, QA. Agents share structured documents (PRD, architecture spec) rather than just conversations. Reduces hallucination and improves code quality.

Key ideas: Standard Operating Procedures (SOPs) constrain agent workflows; structured outputs (not just text); document-driven coordination; role-based execution pipeline
GitHub: geekan/MetaGPT — 45k+ stars

ChatDev: Communicative Agents for Software Development (2023)

Qian et al. · arXiv:2307.07924 · ACL 2024

Complete software development lifecycle with specialized roles: CEO, CTO, Programmer, Reviewer, Tester. Agents use both natural language and programming language to communicate. Produces functional software from text descriptions in minutes.

Key ideas: SOP-constrained agent behavior; mixed natural/programming language communication; role specialization across SDLC phases
GitHub: OpenBMB/ChatDev

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (2023)

Wu et al. (Microsoft) · arXiv:2308.08155

Framework for building conversable agents that can interact with each other and humans. Agents are configurable (roles, tools, human-input modes). Enables complex workflows: code generation → execution → debugging → verification, all between agents.

Key ideas: Conversable agent abstraction; human-in-the-loop anywhere; flexible conversation patterns (sequential, group chat, nested); code execution sandbox
GitHub: microsoft/autogen — 40k+ stars

AgentVerse: Facilitating Multi-Agent Collaboration (2023)

Chen et al. · arXiv:2308.10848

Platform for orchestrating agents in diverse scenarios and studying emergent behaviors — both positive (division of labor, error correction) and negative (groupthink, cascade failures). An early systematic study of both cooperative and failure modes in multi-agent systems. (first-mover claim unverified)

GitHub: OpenBMB/AgentVerse

DyLAN: Dynamic LLM-Powered Agent Network (2023)

Liu et al. · arXiv:2310.02170

Dynamic agent selection for each reasoning step — only the agents most relevant to the current sub-task are activated. More efficient than static all-agent participation.

Key ideas: Dynamic agent recruitment; task-adaptive networks; reduces cost vs. full-team approaches

Solo Performance Prompting (SPP) (2023)

Wang et al. · arXiv:2307.05300

Single LLM plays multiple expert personas sequentially — extracting multi-agent benefits from one model. Improves factuality and reasoning without requiring multiple model instances.

AgentScope: A Flexible yet Robust Multi-Agent Platform (2024)

Alibaba · arXiv:2402.14034

Production-oriented multi-agent platform with fault tolerance, distributed execution, and operator/developer separation. Built for real-world scale.

TapeAgents: Tape-Centric Agent Framework (2024)

Bahdanau, Gontier, Huang et al. (ServiceNow) · arXiv:2412.08445

Agent framework built around a structured log tape — the tape is simultaneously the session log, resumable state, and development artifact. Agents append thought/action steps to the tape; the environment appends observations back. This tape-centric design supports the full agent lifecycle: development (debugging, auditing), post-deployment (evaluation, fine-tuning), and cross-agent knowledge transfer (adapt tapes from other agents).

Has echoes of classical blackboard systems in AI — shared structured state that multiple agents read and write. The tape makes agent behavior transparent, reproducible, and reusable in ways that ad-hoc logging doesn’t.

Open-Source Frameworks

These are the tools practitioners actually build with:

LangChain

github.com/langchain-ai/langchain — 95k+ stars (as of early 2025)

The most widely used LLM application framework. Provides chains, agents, tools, and memory as composable building blocks. The Agents module supports ReAct, structured output, and tool-calling agents.

Philosophy: Composability — mix and match chains, tools, memory
Best for: Rapid prototyping, broad ecosystem of integrations (100+ tools)
Limitation: Can be opaque; complex chains are hard to debug

LangGraph

github.com/langchain-ai/langgraph — 12k+ stars (as of early 2025)

Graph-based framework for stateful, multi-actor workflows. Nodes are agents/functions; edges are transitions. Supports cycles (loops), branching, and human-in-the-loop checkpoints. Production-ready with persistent checkpoints.

Philosophy: Explicit state machines — you control the flow
Best for: Complex multi-agent workflows, production deployments, long-running agents
Key feature: Persistent checkpoints enable fault recovery

AutoGen (Microsoft)

github.com/microsoft/autogen — 40k+ stars (as of early 2025)

Conversable agent framework for multi-agent conversations. AutoGen v0.4 (2024) is a complete rewrite with async-first, event-driven architecture.

Philosophy: Agents as participants in a conversation
Best for: Code generation + execution workflows, research, human-in-the-loop systems
Notable: AutoGen Studio provides a no-code GUI for building agent workflows

CrewAI

github.com/crewAIInc/crewAI — 40k+ stars (as of early 2025)

Role-based agent framework with a high-level “crew” abstraction. Define agents with roles, goals, and backstories; assign tasks; the crew coordinates to complete the objective.

Philosophy: Agents as a team with defined roles
Best for: Business process automation, structured workflows with clear role boundaries
Simplicity: Lower learning curve than LangGraph for basic use cases

DSPy (Stanford)

github.com/stanfordnlp/dspy — 22k+ stars (as of early 2025)

Programming language models via declarative modules and automatic optimization. Instead of hand-crafting prompts, you define a program and DSPy optimizes the prompts automatically.

Philosophy: LLM programs, not prompts; optimize systematically
Best for: Applications where you want systematic prompt optimization, research

Semantic Kernel (Microsoft)

github.com/microsoft/semantic-kernel — 23k+ stars (as of early 2025)

Enterprise-focused SDK for AI-powered applications. Strong .NET/C# support alongside Python. Integrates with Azure AI, OpenAI, and Hugging Face.

Philosophy: Plugins and planners; enterprise integration
Best for: Enterprise deployments, .NET applications

Haystack (deepset)

github.com/deepset-ai/haystack — 18k+ stars (as of early 2025)

Pipeline-based framework for NLP applications with strong RAG and document processing support. v2.0 (2024) redesigned as composable components.

Agno

agno-agi · github.com/agno-agi/agno

Full-stack platform for agentic software: framework + production runtime (FastAPI-based, session-scoped) + control plane (AgentOS UI). Builds stateful agents with memory, 100+ tool integrations, guardrails, and MCP support in ~20 lines. Model-agnostic. See Community Agents → for full coverage.

OpenAI Swarm (2024)

github.com/openai/swarm

Experimental (educational) framework from OpenAI for multi-agent orchestration. Lightweight and minimal. Demonstrates handoffs — one agent transferring control to another based on context.

Philosophy: Minimalism; focus on handoffs and routines
Status: Experimental/educational, not production-intended

Production Case Study: Anthropic’s Multi-Agent Research System (2025)

Anthropic Engineering · anthropic.com/engineering/multi-agent-research-system

One of the most detailed public engineering accounts of building a real production multi-agent system — Anthropic’s own Claude Research feature. Essential reading.

Architecture: Orchestrator-Worker

The Research system uses an orchestrator Claude (Opus 4) that plans a research strategy, then spawns parallel worker agents (Sonnet 4) to pursue different sub-questions simultaneously. Each worker has its own context window and exploration trajectory. Results are compressed back to the orchestrator for synthesis.

Why Multi-Agent? The Core Argument

“The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent.”

Three reasons multi-agent excels for research: 1. Breadth-first parallelism — multiple threads explored simultaneously 2. Context window scaling — total context across agents >> single agent window 3. Path independence — separate explorations avoid getting stuck in the same dead-ends

Quantified Results

Metric	Value
Multi-agent (Opus 4 + Sonnet 4 workers) vs. single Opus 4	+90.2% on internal research eval
Token use: agents vs. chat	~4×
Token use: multi-agent vs. chat	~15×
BrowseComp variance explained by token count alone	80%

The 80% figure is striking: raw token budget explains most of hard web research performance, not clever prompting or architecture. Upgrading from Sonnet 3.7 to Sonnet 4 is a larger gain than doubling the token budget on 3.7.

When Multi-Agent Works (and Doesn’t)

Best for: - Heavy parallelization — many independent sub-queries - Information that exceeds single context windows - Interfacing with numerous complex tools

Poor fit: - Many cross-agent dependencies (agents need to share context frequently) - Real-time coordination between agents - Most coding tasks (fewer truly parallelizable sub-tasks than research)

Engineering Lessons

Tool design matters as much as prompting — poorly designed tools cause cascading failures
Evaluation is uniquely hard — open-ended research quality is difficult to measure automatically
Economic viability gate — multi-agent is only justified for high-value tasks (15× token cost)
Subagent separation of concerns — distinct prompts and tools per worker, not a monolith

Orchestration Patterns

The literature has converged on several reusable patterns:

Pattern 1: Planner → Executor

A planner agent generates a structured plan; executor agents carry it out. Clean separation of concerns, easy to monitor.

Pattern 2: Planner → Executor → Critic

Adds a critic/reviewer agent that evaluates outputs and sends feedback. Used in MetaGPT, ChatDev.

Pattern 3: Hub and Spoke

One orchestrator agent routes tasks to specialist agents. The hub maintains context; spokes are stateless workers. Used in HuggingGPT, many LangGraph workflows.

Pattern 4: Peer Debate

Multiple agents argue different positions and converge on a consensus. Improves factual accuracy. Used in debates for math, science questions.

Pattern 5: Hierarchical Teams

Teams of teams. Each sub-team has an internal coordinator; sub-teams report to a top-level orchestrator. Scales to complex, long-horizon tasks.

Model Context Protocol (MCP)

Anthropic · Announced November 2024

MCP is an open standard for how AI agents communicate with external tools and data sources. Rather than each agent/framework implementing custom tool integrations, MCP provides a universal interface: servers expose tools/resources; agents connect as clients.

Analogy: MCP is to agents what HTTP is to web browsers
Impact: By early 2025, hundreds of MCP servers existed (GitHub, Slack, databases, file systems)
Adoption: OpenAI, Google, and major framework builders endorsed MCP
Docs: modelcontextprotocol.io

Evaluation & Benchmarks

AgentBench (2023)

Liu et al. · arXiv:2308.03688

8-task benchmark spanning operating systems, databases, web browsers, card games, and more. Revealed a large gap between GPT-4 and open-source models on agent tasks.

WebArena (2023)

Zhou et al. · arXiv:2307.13854

Realistic web environment with 5 functional websites (e-commerce, Reddit, GitLab, CMS, map) plus Wikipedia as a reference resource. 812 long-horizon tasks. Baseline GPT-4 success: ~14%.

GAIA: A Benchmark for General AI Assistants (2023)

Mialon et al. · arXiv:2311.12983 · ICLR 2024

Tasks requiring multi-step reasoning, tool use, and common sense. Humans achieve 92%; GPT-4+plugins achieved ~15% at release (arXiv:2311.12983). Tests true general-purpose assistant ability.

OSWorld (2024)

Xie et al. · arXiv:2404.07972

Desktop GUI tasks across Windows/macOS/Linux. 369 real computer tasks. Baseline models achieved <10% at release; by late 2024 Claude reached 22%+, and specialized models pushed toward 72%.

ST-WebAgentBench (2024)

Safety and trustworthiness evaluation for web agents. Tests whether agents follow safety instructions and avoid harmful actions.

References

Papers

Multi-Agent Systems & Frameworks

CAMEL: Communicative Agents for “Mind” Exploration (Li et al., 2023) — arXiv:2303.17760
Generative Agents: Interactive Simulacra of Human Behavior (Park et al., 2023) — arXiv:2304.03442
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (Hong et al., 2023) — arXiv:2308.00352
ChatDev: Communicative Agents for Software Development (Qian et al., 2023) — arXiv:2307.07924
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al., 2023) — arXiv:2308.08155
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors (Chen et al., 2023) — arXiv:2308.10848
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration (Liu et al., 2023) — arXiv:2310.02170 — Published COLM 2024
Solo Performance Prompting (Wang et al., 2023) — arXiv:2307.05300
AgentScope: A Flexible yet Robust Multi-Agent Platform (Alibaba, 2024) — arXiv:2402.14034
TapeAgents: Tape-Centric Agent Framework (Bahdanau et al., 2024) — arXiv:2412.08445

Evaluation & Benchmarks

AgentBench: Evaluating LLMs as Agents (Liu et al., 2023) — arXiv:2308.03688
WebArena: A Realistic Web Environment for Building Autonomous Agents (Zhou et al., 2023) — arXiv:2307.13854
GAIA: A Benchmark for General AI Assistants (Mialon et al., 2023) — arXiv:2311.12983
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Xie et al., 2024) — arXiv:2404.07972
ST-WebAgentBench: Towards Evaluating Safety and Trustworthiness for Autonomous Web Agents (2024) (source needed)

Blog Posts & Articles

Anthropic: Building Effective Agents — anthropic.com/engineering/multi-agent-research-system

Frameworks & Open-Source Projects

LangChain — github.com/langchain-ai/langchain
LangGraph — github.com/langchain-ai/langgraph
AutoGen (Microsoft) — github.com/microsoft/autogen
CrewAI — github.com/crewAIInc/crewAI
DSPy (Stanford) — github.com/stanfordnlp/dspy
Semantic Kernel (Microsoft) — github.com/microsoft/semantic-kernel
Haystack (deepset) — github.com/deepset-ai/haystack
Agno — github.com/agno-agi/agno
OpenAI Swarm — github.com/openai/swarm
CAMEL — github.com/camel-ai/camel
MetaGPT — github.com/geekan/MetaGPT
ChatDev — github.com/OpenBMB/ChatDev
AgentVerse — github.com/OpenBMB/AgentVerse

Standards & Specifications

Model Context Protocol (MCP) (Anthropic, 2024) — modelcontextprotocol.io

Full chronology in the Timeline →. Continue to Memory, Tools & Actions →

Overview

Foundational Multi-Agent Papers

CAMEL: Communicative Agents for “Mind” Exploration (2023)

Generative Agents: Interactive Simulacra of Human Behavior (2023)

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (2023)

ChatDev: Communicative Agents for Software Development (2023)

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (2023)

AgentVerse: Facilitating Multi-Agent Collaboration (2023)

DyLAN: Dynamic LLM-Powered Agent Network (2023)

Solo Performance Prompting (SPP) (2023)

AgentScope: A Flexible yet Robust Multi-Agent Platform (2024)

TapeAgents: Tape-Centric Agent Framework (2024)

Open-Source Frameworks

LangChain

LangGraph

AutoGen (Microsoft)

CrewAI

DSPy (Stanford)

Semantic Kernel (Microsoft)

Haystack (deepset)

Agno

OpenAI Swarm (2024)

Production Case Study: Anthropic’s Multi-Agent Research System (2025)

Architecture: Orchestrator-Worker

Why Multi-Agent? The Core Argument

Quantified Results

When Multi-Agent Works (and Doesn’t)

Engineering Lessons

Orchestration Patterns

Pattern 1: Planner → Executor

Pattern 2: Planner → Executor → Critic

Pattern 3: Hub and Spoke

Pattern 4: Peer Debate

Pattern 5: Hierarchical Teams

Theory of Mind & Social Capabilities

Emergent Social Behavior in Agent Societies

Model Context Protocol (MCP)

Evaluation & Benchmarks

AgentBench (2023)

WebArena (2023)

GAIA: A Benchmark for General AI Assistants (2023)

OSWorld (2024)

ST-WebAgentBench (2024)

References

Papers

Multi-Agent Systems & Frameworks

Evaluation & Benchmarks

Theory of Mind & Social Behavior

Blog Posts & Articles

Frameworks & Open-Source Projects

Standards & Specifications