Multi-Agent Systems & Frameworks

Multiple LLMs collaborating — and the tools to build them

Overview

Multi-agent systems represent a fundamental shift in AI architecture: instead of one LLM doing everything, specialized agents collaborate, each contributing expertise. This area exploded in 2023-2024, moving from academic curiosity to production frameworks used by thousands of developers.

Key themes:

  • Role specialization: Planner, executor, critic, coder, tester — each agent excels at one thing
  • Communication & coordination: How agents talk to each other, structure messages, and avoid confusion
  • Emergent behavior: What happens when agents interact — including both cooperation and failure modes
  • Framework maturation: From research prototypes (AutoGPT) to production systems (LangGraph, AutoGen)

Foundational Multi-Agent Papers

CAMEL: Communicative Agents for “Mind” Exploration (2023)

Li et al. · arXiv:2303.17760 · NeurIPS 2023

One of the first systematic studies of autonomous multi-agent cooperation. Introduces inception prompting to guide two agents (one playing a user, one an assistant) to collaboratively complete tasks without human intervention. Also a framework for generating conversational datasets.

  • Key ideas: Role-playing enables autonomous cooperation; inception prompting specifies agent personas; emergent task completion without human oversight
  • GitHub: camel-ai/camel

Generative Agents: Interactive Simulacra of Human Behavior (2023)

Park et al. · arXiv:2304.03442 · UIST 2023 Best Paper

25 LLM-powered agents living in a simulated town (“Smallville”). Agents plan daily schedules, form relationships, remember experiences, and exhibit emergent social behaviors. Landmark demonstration of agent societies.

  • Key ideas: Hierarchical memory (observation → reflection → planning); agents spontaneously organize events (Valentine’s Day party emerged from initial prompt about one agent); believable human-like behavior
  • Influence: Every subsequent work on agent memory architecture cites this paper

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (2023)

Hong et al. · arXiv:2308.00352

Software development with LLM agents assigned roles mirroring a real software company: Product Manager, Architect, Engineer, QA. Agents share structured documents (PRD, architecture spec) rather than just conversations. Reduces hallucination and improves code quality.

  • Key ideas: Standard Operating Procedures (SOPs) constrain agent workflows; structured outputs (not just text); document-driven coordination; role-based execution pipeline
  • GitHub: geekan/MetaGPT — 45k+ stars

ChatDev: Communicative Agents for Software Development (2023)

Qian et al. · arXiv:2307.07924 · ACL 2024

Complete software development lifecycle with specialized roles: CEO, CTO, Programmer, Reviewer, Tester. Agents use both natural language and programming language to communicate. Produces functional software from text descriptions in minutes.

  • Key ideas: SOP-constrained agent behavior; mixed natural/programming language communication; role specialization across SDLC phases
  • GitHub: OpenBMB/ChatDev

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (2023)

Wu et al. (Microsoft) · arXiv:2308.08155

Framework for building conversable agents that can interact with each other and humans. Agents are configurable (roles, tools, human-input modes). Enables complex workflows: code generation → execution → debugging → verification, all between agents.

  • Key ideas: Conversable agent abstraction; human-in-the-loop anywhere; flexible conversation patterns (sequential, group chat, nested); code execution sandbox
  • GitHub: microsoft/autogen — 40k+ stars

AgentVerse: Facilitating Multi-Agent Collaboration (2023)

Chen et al. · arXiv:2308.10848

Platform for orchestrating agents in diverse scenarios and studying emergent behaviors — both positive (division of labor, error correction) and negative (groupthink, cascade failures). An early systematic study of both cooperative and failure modes in multi-agent systems. (first-mover claim unverified)

DyLAN: Dynamic LLM-Powered Agent Network (2023)

Liu et al. · arXiv:2310.02170

Dynamic agent selection for each reasoning step — only the agents most relevant to the current sub-task are activated. More efficient than static all-agent participation.

  • Key ideas: Dynamic agent recruitment; task-adaptive networks; reduces cost vs. full-team approaches

Solo Performance Prompting (SPP) (2023)

Wang et al. · arXiv:2307.05300

Single LLM plays multiple expert personas sequentially — extracting multi-agent benefits from one model. Improves factuality and reasoning without requiring multiple model instances.

AgentScope: A Flexible yet Robust Multi-Agent Platform (2024)

Alibaba · arXiv:2402.14034

Production-oriented multi-agent platform with fault tolerance, distributed execution, and operator/developer separation. Built for real-world scale.

TapeAgents: Tape-Centric Agent Framework (2024)

Bahdanau, Gontier, Huang et al. (ServiceNow) · arXiv:2412.08445

Agent framework built around a structured log tape — the tape is simultaneously the session log, resumable state, and development artifact. Agents append thought/action steps to the tape; the environment appends observations back. This tape-centric design supports the full agent lifecycle: development (debugging, auditing), post-deployment (evaluation, fine-tuning), and cross-agent knowledge transfer (adapt tapes from other agents).

Has echoes of classical blackboard systems in AI — shared structured state that multiple agents read and write. The tape makes agent behavior transparent, reproducible, and reusable in ways that ad-hoc logging doesn’t.


Open-Source Frameworks

These are the tools practitioners actually build with:

LangChain

github.com/langchain-ai/langchain — 95k+ stars (as of early 2025)

The most widely used LLM application framework. Provides chains, agents, tools, and memory as composable building blocks. The Agents module supports ReAct, structured output, and tool-calling agents.

  • Philosophy: Composability — mix and match chains, tools, memory
  • Best for: Rapid prototyping, broad ecosystem of integrations (100+ tools)
  • Limitation: Can be opaque; complex chains are hard to debug

LangGraph

github.com/langchain-ai/langgraph — 12k+ stars (as of early 2025)

Graph-based framework for stateful, multi-actor workflows. Nodes are agents/functions; edges are transitions. Supports cycles (loops), branching, and human-in-the-loop checkpoints. Production-ready with persistent checkpoints.

  • Philosophy: Explicit state machines — you control the flow
  • Best for: Complex multi-agent workflows, production deployments, long-running agents
  • Key feature: Persistent checkpoints enable fault recovery

AutoGen (Microsoft)

github.com/microsoft/autogen — 40k+ stars (as of early 2025)

Conversable agent framework for multi-agent conversations. AutoGen v0.4 (2024) is a complete rewrite with async-first, event-driven architecture.

  • Philosophy: Agents as participants in a conversation
  • Best for: Code generation + execution workflows, research, human-in-the-loop systems
  • Notable: AutoGen Studio provides a no-code GUI for building agent workflows

CrewAI

github.com/crewAIInc/crewAI — 40k+ stars (as of early 2025)

Role-based agent framework with a high-level “crew” abstraction. Define agents with roles, goals, and backstories; assign tasks; the crew coordinates to complete the objective.

  • Philosophy: Agents as a team with defined roles
  • Best for: Business process automation, structured workflows with clear role boundaries
  • Simplicity: Lower learning curve than LangGraph for basic use cases

DSPy (Stanford)

github.com/stanfordnlp/dspy — 22k+ stars (as of early 2025)

Programming language models via declarative modules and automatic optimization. Instead of hand-crafting prompts, you define a program and DSPy optimizes the prompts automatically.

  • Philosophy: LLM programs, not prompts; optimize systematically
  • Best for: Applications where you want systematic prompt optimization, research

Semantic Kernel (Microsoft)

github.com/microsoft/semantic-kernel — 23k+ stars (as of early 2025)

Enterprise-focused SDK for AI-powered applications. Strong .NET/C# support alongside Python. Integrates with Azure AI, OpenAI, and Hugging Face.

  • Philosophy: Plugins and planners; enterprise integration
  • Best for: Enterprise deployments, .NET applications

Haystack (deepset)

github.com/deepset-ai/haystack — 18k+ stars (as of early 2025)

Pipeline-based framework for NLP applications with strong RAG and document processing support. v2.0 (2024) redesigned as composable components.

Agno

agno-agi · github.com/agno-agi/agno

Full-stack platform for agentic software: framework + production runtime (FastAPI-based, session-scoped) + control plane (AgentOS UI). Builds stateful agents with memory, 100+ tool integrations, guardrails, and MCP support in ~20 lines. Model-agnostic. See Community Agents → for full coverage.

OpenAI Swarm (2024)

github.com/openai/swarm

Experimental (educational) framework from OpenAI for multi-agent orchestration. Lightweight and minimal. Demonstrates handoffs — one agent transferring control to another based on context.

  • Philosophy: Minimalism; focus on handoffs and routines
  • Status: Experimental/educational, not production-intended

Production Case Study: Anthropic’s Multi-Agent Research System (2025)

Anthropic Engineering · anthropic.com/engineering/multi-agent-research-system

One of the most detailed public engineering accounts of building a real production multi-agent system — Anthropic’s own Claude Research feature. Essential reading.

Architecture: Orchestrator-Worker

The Research system uses an orchestrator Claude (Opus 4) that plans a research strategy, then spawns parallel worker agents (Sonnet 4) to pursue different sub-questions simultaneously. Each worker has its own context window and exploration trajectory. Results are compressed back to the orchestrator for synthesis.

Why Multi-Agent? The Core Argument

“The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent.”

Three reasons multi-agent excels for research: 1. Breadth-first parallelism — multiple threads explored simultaneously 2. Context window scaling — total context across agents >> single agent window 3. Path independence — separate explorations avoid getting stuck in the same dead-ends

Quantified Results

Metric Value
Multi-agent (Opus 4 + Sonnet 4 workers) vs. single Opus 4 +90.2% on internal research eval
Token use: agents vs. chat ~4×
Token use: multi-agent vs. chat ~15×
BrowseComp variance explained by token count alone 80%

The 80% figure is striking: raw token budget explains most of hard web research performance, not clever prompting or architecture. Upgrading from Sonnet 3.7 to Sonnet 4 is a larger gain than doubling the token budget on 3.7.

When Multi-Agent Works (and Doesn’t)

Best for: - Heavy parallelization — many independent sub-queries - Information that exceeds single context windows - Interfacing with numerous complex tools

Poor fit: - Many cross-agent dependencies (agents need to share context frequently) - Real-time coordination between agents - Most coding tasks (fewer truly parallelizable sub-tasks than research)

Engineering Lessons

  • Tool design matters as much as prompting — poorly designed tools cause cascading failures
  • Evaluation is uniquely hard — open-ended research quality is difficult to measure automatically
  • Economic viability gate — multi-agent is only justified for high-value tasks (15× token cost)
  • Subagent separation of concerns — distinct prompts and tools per worker, not a monolith

Orchestration Patterns

The literature has converged on several reusable patterns:

Pattern 1: Planner → Executor

A planner agent generates a structured plan; executor agents carry it out. Clean separation of concerns, easy to monitor.

Pattern 2: Planner → Executor → Critic

Adds a critic/reviewer agent that evaluates outputs and sends feedback. Used in MetaGPT, ChatDev.

Pattern 3: Hub and Spoke

One orchestrator agent routes tasks to specialist agents. The hub maintains context; spokes are stateless workers. Used in HuggingGPT, many LangGraph workflows.

Pattern 4: Peer Debate

Multiple agents argue different positions and converge on a consensus. Improves factual accuracy. Used in debates for math, science questions.

Pattern 5: Hierarchical Teams

Teams of teams. Each sub-team has an internal coordinator; sub-teams report to a top-level orchestrator. Scales to complex, long-horizon tasks.


Theory of Mind & Social Capabilities

Effective multi-agent interaction requires agents to model each other’s intentions, beliefs, and goals — what cognitive science calls Theory of Mind (ToM). Plaat et al. (2025) dedicate a full section to this, covering:

  • Strategic behavior — game-theoretic reasoning, negotiation, cooperative/competitive dynamics between LLM agents
  • Theory of Mind benchmarks — tests of whether LLMs can correctly model what another agent knows vs. doesn’t know
  • Negotiation agents — conversational agents that can bargain, persuade, and reach agreements

This is a frequently underemphasized dimension of multi-agent systems: it’s not just about task decomposition and tool calls — it’s about whether agents can genuinely reason about each other.

Emergent Social Behavior in Agent Societies

Beyond structured multi-agent workflows, a distinct research thread studies what happens when many agents interact freely — emergent social behavior.

  • Generative Agents (Park et al., 2023) — the foundational demonstration: 25 agents in a simulated town develop social relationships, organize events, and exhibit norms without explicit programming
  • Emergent social norms — agents interacting in open-ended environments can develop conventions, role divisions, and behavioral norms that weren’t programmed in
  • Large-scale simulation — agent societies can run social science experiments at scales and speeds impossible with human participants
  • New training data — agent-agent interactions generate training data that can feed back into better base models (the virtuous cycle; see Taxonomy →)

Research implications: The study of emergent norms in agent societies has implications for AI alignment (can we engineer beneficial norms?), social science (are LLM-simulated societies valid models of human society?), and AI safety (what norms emerge by default, and are they safe?).

Model Context Protocol (MCP)

Anthropic · Announced November 2024

MCP is an open standard for how AI agents communicate with external tools and data sources. Rather than each agent/framework implementing custom tool integrations, MCP provides a universal interface: servers expose tools/resources; agents connect as clients.

  • Analogy: MCP is to agents what HTTP is to web browsers
  • Impact: By early 2025, hundreds of MCP servers existed (GitHub, Slack, databases, file systems)
  • Adoption: OpenAI, Google, and major framework builders endorsed MCP
  • Docs: modelcontextprotocol.io

Evaluation & Benchmarks

AgentBench (2023)

Liu et al. · arXiv:2308.03688

8-task benchmark spanning operating systems, databases, web browsers, card games, and more. Revealed a large gap between GPT-4 and open-source models on agent tasks.

WebArena (2023)

Zhou et al. · arXiv:2307.13854

Realistic web environment with 5 functional websites (e-commerce, Reddit, GitLab, CMS, map) plus Wikipedia as a reference resource. 812 long-horizon tasks. Baseline GPT-4 success: ~14%.

GAIA: A Benchmark for General AI Assistants (2023)

Mialon et al. · arXiv:2311.12983 · ICLR 2024

Tasks requiring multi-step reasoning, tool use, and common sense. Humans achieve 92%; GPT-4+plugins achieved ~15% at release (arXiv:2311.12983). Tests true general-purpose assistant ability.

OSWorld (2024)

Xie et al. · arXiv:2404.07972

Desktop GUI tasks across Windows/macOS/Linux. 369 real computer tasks. Baseline models achieved <10% at release; by late 2024 Claude reached 22%+, and specialized models pushed toward 72%.

ST-WebAgentBench (2024)

Safety and trustworthiness evaluation for web agents. Tests whether agents follow safety instructions and avoid harmful actions.



References

Papers

Multi-Agent Systems & Frameworks

  • CAMEL: Communicative Agents for “Mind” Exploration (Li et al., 2023) — arXiv:2303.17760
  • Generative Agents: Interactive Simulacra of Human Behavior (Park et al., 2023) — arXiv:2304.03442
  • MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (Hong et al., 2023) — arXiv:2308.00352
  • ChatDev: Communicative Agents for Software Development (Qian et al., 2023) — arXiv:2307.07924
  • AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al., 2023) — arXiv:2308.08155
  • AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors (Chen et al., 2023) — arXiv:2308.10848
  • A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration (Liu et al., 2023) — arXiv:2310.02170Published COLM 2024
  • Solo Performance Prompting (Wang et al., 2023) — arXiv:2307.05300
  • AgentScope: A Flexible yet Robust Multi-Agent Platform (Alibaba, 2024) — arXiv:2402.14034
  • TapeAgents: Tape-Centric Agent Framework (Bahdanau et al., 2024) — arXiv:2412.08445

Evaluation & Benchmarks

  • AgentBench: Evaluating LLMs as Agents (Liu et al., 2023) — arXiv:2308.03688
  • WebArena: A Realistic Web Environment for Building Autonomous Agents (Zhou et al., 2023) — arXiv:2307.13854
  • GAIA: A Benchmark for General AI Assistants (Mialon et al., 2023) — arXiv:2311.12983
  • OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Xie et al., 2024) — arXiv:2404.07972
  • ST-WebAgentBench: Towards Evaluating Safety and Trustworthiness for Autonomous Web Agents (2024) (source needed)

Theory of Mind & Social Behavior

  • Towards a Unified Taxonomy of Multi-Agent LLMs: Survey, Evaluation, and Future Research Directions (Plaat et al., 2025) (cite as reference for theory of mind section and emergent social behavior)

Blog Posts & Articles

Frameworks & Open-Source Projects

Standards & Specifications


Full chronology in the Timeline →. Continue to Memory, Tools & Actions →