Agent Composition Patterns

Architectural patterns for combining agents — orchestration, pipelines, and multi-agent topologies

Overview

A single LLM call, however capable, has limits. It can’t parallelize reasoning across many dimensions simultaneously, maintain persistent memory across weeks, or specialize in twenty domains at once. Agent composition is the practice of combining multiple agents — each with its own role, context, and capabilities — into systems that exceed what any one agent could do alone.

This page focuses on architectural topology: the structural patterns for how agents connect, delegate, and communicate. It is deliberately distinct from the Multi-Agent Systems → page, which covers coordination protocols, communication standards, and emergent collective behavior. Here, the question is simpler and more engineering-oriented: what shape should a multi-agent system take?

The analogy to software engineering is instructive. The Gang of Four’s Design Patterns (1994) gave software developers a shared vocabulary — Factory, Observer, Decorator — for recurring structural problems. Agent systems are now reaching a similar inflection point: the same architectural problems keep reappearing (task decomposition, parallel execution, shared state, event-driven triggering), and a pattern vocabulary is emerging. Unlike GoF patterns, however, agent patterns come with probabilistic semantics: outputs are not deterministic, errors propagate non-trivially, and “correctness” is harder to define.

Anthropic’s widely-cited guide Building Effective Agents (2024) distills experience from working with dozens of production agent teams. Its central thesis: the most successful implementations use simple, composable patterns rather than heavyweight frameworks. That principle animates this page.

Core Topologies

1. Orchestrator–Worker (Hub-and-Spoke)

Structure: A central orchestrator agent receives a task, decomposes it dynamically, and dispatches subtasks to specialized worker agents. Workers return results; the orchestrator synthesizes and decides next steps.

         ┌─────────────┐
         │ Orchestrator │
         └──────┬──────┘
        ┌───────┼───────┐
        ▼       ▼       ▼
   [Worker A] [Worker B] [Worker C]
   (search)   (code)     (write)

This is the dominant pattern in production agentic systems. Anthropic’s guide describes it as the go-to topology “for complex tasks where you can’t predict the subtasks needed” — the orchestrator only decides what to delegate once it sees the full problem.

Key examples:

MetaGPT (Hong et al., 2023) — Encodes human software-development workflows (Standardized Operating Procedures) into a structured pipeline where a Product Manager agent decomposes requirements, an Architect designs the system, Engineers implement code, and QA agents verify output. Notably, MetaGPT uses an “assembly line paradigm” with structured shared artifacts (design docs, API specs) rather than purely conversational handoffs, reducing cascading hallucinations compared to chat-based approaches.
AutoGen (Wu et al., 2023) — Microsoft’s open-source framework explicitly supports a “GroupChat with Manager” pattern where a manager agent routes conversational turns among specialized agents. Both natural language and code can define the conversation protocol.
LangGraph Supervisor pattern — LangGraph’s StateGraph allows a supervisor node to conditionally route to worker subgraphs based on accumulated state, with explicit human-in-the-loop checkpoints.
CrewAI — A role-based orchestration framework where a “Manager” agent assigns tasks to crew members with defined roles, goals, and backstories. Supports both sequential and hierarchical task processing.

When to use: Tasks with clear decomposition into heterogeneous subtasks; specialized sub-agents that must operate with domain-specific context; work that benefits from parallelism across independent dimensions.

Failure modes: - Orchestrator bottleneck: All context flows through one agent, creating a single point of failure and a potential context-window limit. - Hallucination propagation: If a worker produces a flawed intermediate result and the orchestrator doesn’t validate it, errors compound downstream — MetaGPT directly addresses this by requiring structured intermediate artifacts. - Coordination overhead: Each dispatch incurs latency and cost. Hierarchies that are too deep can become slower than a single agent.

2. Sequential Pipeline

Structure: Agents arranged in a fixed chain, each processing the output of the previous: A → B → C → result.

[Agent A] ──► [Agent B] ──► [Agent C] ──► Output
(extract)     (transform)    (format)

The sequential pipeline is the simplest multi-agent topology and maps directly to prompt chaining — a core workflow described in Anthropic’s building guide. Each step is small, verifiable, and specialized.

Key examples: - Document processing pipelines: OCR agent → extraction agent → summarization agent → quality-check agent. - Code review chains: static analysis agent → security review agent → style check agent → synthesis agent. - Data transformation workflows in ML pipelines, where each agent handles a distinct data-cleaning or enrichment stage.

Tradeoffs vs. orchestrator pattern:

Dimension	Sequential Pipeline	Orchestrator–Worker
Predictability	High (fixed flow)	Lower (dynamic)
Latency	Additive (serial)	Potentially parallel
Flexibility	Low	High
Debugging	Easier (inspect each step)	Harder
Cost	Proportional to chain length	Variable

When to use: Deterministic, well-understood workflows where each transformation stage is well-defined and order matters. The key signal: if you can draw the flowchart before the task starts, a pipeline is appropriate. If the task structure is only knowable at runtime, use an orchestrator.

3. Parallel Fan-Out / Fan-In

Structure: A dispatcher agent fans out the same (or related) work to multiple agents in parallel; a reducer agent aggregates the results.

              ┌─► [Agent A] ─┐
[Dispatcher] ─┼─► [Agent B] ─┼─► [Aggregator] ──► Output
              └─► [Agent C] ─┘

This is the agent analog of MapReduce: parallelize when work can be divided along independent dimensions, then consolidate.

Key examples:

Improving Factuality via Multiagent Debate (Du et al., 2023) — Multiple LLM instances independently generate answers and reasoning, then debate over multiple rounds to converge on a final answer. The paper reports significant improvements in mathematical and strategic reasoning and reduced hallucinations, using a pure fan-out/fan-in topology applied iteratively.
Mixture-of-Agents (MoA) (Wang et al., 2024) — Constructs a layered fan-out architecture where each layer contains multiple LLM agents. Agents in each layer take all outputs from the previous layer as auxiliary context. MoA using only open-source LLMs achieved a score of 65.1% on AlpacaEval 2.0, surpassing GPT-4 Omni’s 57.5% at the time of publication.
Multi-perspective research: STORM (Shao et al., 2024) uses multiple simulated “writer” agents with distinct perspectives to independently question a topic expert, then aggregates findings into a Wikipedia-like outline. Compared to a single-agent outline-driven baseline, STORM-generated articles were deemed more organized (+25% absolute) and broader in coverage (+10%) by human evaluators.
LATS (Language Agent Tree Search) (Zhou et al., 2023) — Integrates Monte Carlo Tree Search with LLM agents, running multiple parallel rollouts (branches) of the reasoning tree simultaneously. Fan-out here is over possible reasoning trajectories rather than problem dimensions. LATS achieves 92.7% pass@1 on HumanEval with GPT-4.

When to use: Tasks that benefit from diverse perspectives; independent evaluations that catch errors single-agent chains miss; situations where variance in outputs is a problem and ensemble averaging helps. Also valuable for speculative execution (see §6).

Cost warning: Parallelism multiplies API calls. A depth-3 MoA with 5 agents per layer costs ~16× the API calls of a single agent for the same task (3 layers × 5 agents + 1 final aggregator = 16 calls).

4. Blackboard Architecture

Structure: A shared, mutable blackboard (data store) that all agents read from and write to. Agents are triggered when the blackboard state matches their competency, update it with their contributions, and yield to others.

     ┌──────────────────────────────┐
     │         BLACKBOARD           │
     │  {task, partial_results,     │
     │   current_plan, artifacts}   │
     └──┬──────┬──────┬─────────────┘
        │      │      │
   [Agent A] [Agent B] [Agent C]
   (reads &  (reads &  (reads &
    writes)   writes)   writes)

The blackboard pattern originates in classical AI (the HEARSAY-II speech recognition system, ca. 1980), and maps naturally to LLM agent systems via shared state. LangGraph’s StateGraph is essentially a typed blackboard: nodes read from and write to a shared State object, and edges conditionally route based on its content.

Differences from pipeline: Pipelines are push-based (output of A goes to B). Blackboards are pull-based: agents consult shared state and contribute when ready. This makes them better suited to heterogeneous work where the order of contributions is not predetermined.

When to use: Complex, long-horizon tasks where multiple agents contribute asynchronously; tasks requiring shared context that grows over time (research projects, multi-session workflows); when agents have heterogeneous specializations that are hard to order in a fixed pipeline.

Challenges: Shared mutable state introduces write-conflict risks. Schema evolution (the blackboard’s structure changing mid-task) is tricky. Debugging requires inspecting state at each transition.

5. Event-Driven / Reactive

Structure: Agents are not called explicitly — they subscribe to event streams and trigger when matching events arrive.

[Email arrives] ──► [Email agent]
[Calendar event] ──► [Schedule agent]  ──► [Synthesis agent] ──► Action
[Sensor reading] ──► [Monitoring agent]

Event-driven architectures use message queues (Kafka, RabbitMQ, AWS SQS) or pub-sub systems to decouple producers from consumers. This pattern underlies most ambient or background agents — systems that operate continuously, waiting for the world to change rather than processing a single query.

Key examples: - An email-watching agent that triggers a calendar scheduling agent when meeting requests arrive. - Monitoring agents that observe system metrics and spawn remediation agents when thresholds are crossed. - Customer service pipelines where incoming tickets trigger triage, routing, and response agents.

Connection to ambient agents: The event-driven pattern is the natural habitat for agents that operate autonomously over extended periods — a major research focus in long-horizon task completion. The challenge is state management: agents need durable state that survives between events, which is where platforms like Temporal become relevant (see §5).

When to use: Background automation; integrations with external systems that push events; when agents need to react to the real world continuously rather than in discrete sessions.

6. Hierarchical Decomposition

Structure: Agents recursively spawn sub-agents to handle sub-problems, creating a tree of delegations.

[Root Agent]
├── [Sub-Agent A]
│   ├── [Sub-Sub-Agent A1]
│   └── [Sub-Sub-Agent A2]
└── [Sub-Agent B]
    └── [Sub-Sub-Agent B1]

This pattern enables recursive problem-solving: an agent decomposes its goal into sub-goals, spawns agents to solve them, and synthesizes results — with each sub-agent potentially doing the same.

Key examples:

STORM (Shao et al., 2024) — Spawns multiple perspective-specific research agents, each of which conducts its own multi-turn interviews, then passes results to an outline-generation agent, then a writing agent. The hierarchy is shallow (3 levels) but explicit.
LATS (Zhou et al., 2023) — The tree search creates a dynamic hierarchy of reasoning agents, where each node in the MCTS tree corresponds to an agent state. Value functions (also LLM-based) prune branches.
Complex research tasks in agentic coding systems: a lead agent spawns a requirements analyst, who spawns a test writer, who spawns an edge-case generator.

Depth limits and costs: Hierarchical decomposition has sharp cost implications. A tree of depth d with branching factor b costs O(b^d) API calls. Practical systems limit depth to 2–3 levels and prune aggressively. LATS finds its compute-performance sweet spot around 25–50 MCTS rollouts per problem (per empirical evaluation in the paper).

When to use: Tasks that are naturally recursive (research, code generation with tests); when sub-problems require specialized context that the parent agent can’t hold; when depth of exploration (not just breadth) matters.

Design Principles

Separation of Concerns

Each agent should do one thing well. The value of composition comes from specialization: an agent with a focused role, a tightly scoped system prompt, and a small tool set is more reliable than a generalist agent with a large context and many tools. MetaGPT operationalizes this through its SOP-based role definitions; CrewAI through explicit role, goal, and backstory fields that shape each agent’s behavior.

Idempotency

Where possible, agent operations should produce the same output given the same input. This supports retries (a failed call can be rerun without side effects), caching (memoize deterministic sub-computations), and testability (unit-test individual agents in isolation). In practice, LLMs are non-deterministic — design around this by making downstream agents tolerant of variation in upstream outputs, or by post-processing to normalize formats.

Checkpointing

Long multi-agent workflows should save state at meaningful boundaries. If a 20-minute research pipeline fails at step 18, it should resume from step 17, not restart from scratch. LangGraph provides first-class checkpointing via its persistence layer; Temporal offers durable execution that survives process crashes by persisting workflow history. Anthropic’s guide recommends building checkpoints at every meaningful stage for long-horizon agentic tasks.

Graceful Degradation

Composed systems fail in complex ways. Design for partial failure: if a worker agent times out, can the orchestrator proceed with partial results? Can it retry with a different strategy? Can it fall back to a simpler approach? AutoGen’s conversation framework includes termination conditions and fallback patterns. The key insight: fail loudly at boundaries, not silently inside agents.

Context Management

Context is the connective tissue of agent composition. How information flows between agents determines both quality and cost. Three strategies:

Full context passing — pass all context at each hand-off. High quality, high cost.
Summary-based — each agent summarizes its contribution; downstream agents receive summaries. Lossy but scalable.
Shared store — blackboard/shared state that agents query selectively. Efficient but requires schema discipline.

The right choice depends on the topology: pipelines favor full passing (each step builds on the previous), blackboards favor selective querying, orchestrators often use summaries to avoid context explosion.

When to Use Which Pattern

Choosing a composition pattern is primarily a function of task structure:

Task Structure	Recommended Pattern
Clear fixed stages, order matters	Sequential Pipeline
Subtasks are independent, can parallelize	Parallel Fan-Out / Fan-In
Subtask structure unknown until runtime	Orchestrator–Worker
Agents contribute asynchronously to shared work	Blackboard
Agents react to external events	Event-Driven
Problem is recursively decomposable	Hierarchical Decomposition
Multiple approaches, use best result	Speculative Execution

The simplicity principle: Anthropic’s guide explicitly recommends starting with the simplest topology that works and only adding complexity when performance demands it. Many tasks that seem to need orchestration are better solved with a well-designed single-agent loop.

Cost implications: - Sequential pipelines scale linearly with chain length. - Fan-out scales with the number of parallel agents × layers. - Hierarchical decomposition scales exponentially with depth — budget carefully. - Blackboards and event-driven patterns scale with activity frequency, not task complexity.

LangGraph’s documentation offers a useful heuristic: prefer StateGraph (explicit state machine) over dynamic agent loops when the workflow can be pre-specified; prefer dynamic agents when branching logic must emerge from model reasoning.

Frameworks and Implementation

LangGraph

LangGraph — LangChain’s low-level agent orchestration framework — expresses agent composition as a typed state graph. Nodes are agent functions or subgraphs; edges are conditional routing functions over state. This makes LangGraph a near-universal pattern language: sequential pipelines, orchestrator-worker, fan-out, and hierarchical patterns all map to StateGraph configurations. Key features include built-in persistence (checkpointing), streaming, human-in-the-loop interrupt points, and a cloud deployment target (LangGraph Platform).

AutoGen

AutoGen (Microsoft, 2023) takes a conversation-first view of composition: agents are participants in a multi-turn dialogue, and patterns emerge from who speaks when. Its GroupChat with a GroupChatManager implements orchestrator-worker; its two-agent UserProxy/AssistantAgent pair implements pipeline; custom reply functions implement event-driven triggers. AutoGen v0.4 redesigned the core architecture around event-driven, message-passing agents (similar to the actor model) for better scalability and cross-language support.

CrewAI

CrewAI is purpose-built for role-based orchestration. Agents are defined with roles, goals, and backstories; tasks are assigned to agents with expected outputs; a Crew coordinates execution. It natively supports both sequential and hierarchical process modes. The role-playing framing has proven effective for structured professional workflows (research, content, software teams).

Semantic Kernel

Microsoft’s Semantic Kernel provides agent composition through its Process Framework — a workflow engine where agents are steps in a stateful process, with events triggering state transitions. Useful for enterprise-grade, long-running workflows with strict control-flow requirements.

Durable Execution: Temporal and Inngest

Pure LLM frameworks handle composition at the prompt level but often lack infrastructure-level durability. Temporal provides durable execution: workflows are code that runs to completion despite process crashes, network failures, or service restarts, by persisting execution history. This is critical for long-running agent pipelines (hours or days). Inngest offers a similar model optimized for serverless/edge environments.

The key insight: for multi-agent workflows that span minutes or hours, you need both agent-level composition (who talks to whom) and infrastructure-level durability (how execution survives failures).

Emerging Patterns

Speculative Execution

Run multiple agents on the same task in parallel; use the first to succeed (or the best result). This trades API cost for latency and reliability. Useful when individual agent runs have non-trivial failure rates — a single slow or hallucinating agent doesn’t block the whole pipeline. Related to the Mixture-of-Agents (MoA) pattern but with early-termination semantics.

Critic–Generator Loops

One agent generates a candidate output; a separate critic agent evaluates it against a rubric and provides structured feedback; the generator revises. This loop continues until the critic is satisfied or a budget is exhausted. Natively supported in AutoGen’s two-agent feedback loop pattern (e.g., code execution driving iterative correction) and LangGraph’s evaluator-optimizer workflow. Conceptually related to self-play in RL: the critic provides a training signal the generator optimizes against.

The debate pattern in Du et al. (2023) is a multi-agent variant: multiple generators + implicit mutual critique → convergence.

Constitutional Agents

Inspired by Constitutional AI (Bai et al., 2022) — a technique where an AI evaluates and revises its own outputs according to a set of principles — constitutional agents embed a lightweight evaluator into the agent loop itself. Rather than an external critic, the agent carries its own constitution and self-edits before returning results. This pattern fuses the generator and critic into a single agent, at the cost of transparency (the critique is internal and may not be inspectable).

Agent-as-Tool

Treating an entire agent — with its own tool access, memory, and reasoning — as a callable function from another agent. This enables clean hierarchical composition: the orchestrating agent doesn’t need to know how its “tools” are implemented; one of those tools might itself be a full reasoning agent. Anthropic’s multi-agent guide shows concretely how subagents can be spawned as async tool calls, with the orchestrator aggregating results via asyncio.gather. This is the composition primitive underlying all hierarchical and orchestrator-worker topologies.

Routing / Classifier-First

A lightweight router agent classifies an incoming request and dispatches it to the most appropriate specialized downstream agent — rather than a full orchestrator that synthesizes output. The router adds minimal latency but dramatically improves downstream agent quality by ensuring each agent only sees tasks it’s good at. Anthropic’s guide calls this “routing” and identifies it as one of the five core workflow patterns.

Automated Workflow Search

Rather than hand-designing composition topologies, recent work treats workflow structure itself as a search problem. AFlow (Zhao et al., 2024; ICLR 2025) uses Monte Carlo Tree Search over a space of code-based workflow operators to automatically discover effective agent compositions for a given task, outperforming hand-crafted workflows on benchmarks including HumanEval, MBPP, and HotpotQA. This represents a shift from designing patterns to learning them.

Open Problems

No Standard Pattern Language

Unlike software design patterns, agent composition patterns lack a formal, shared vocabulary. Different frameworks use different terminology (LangGraph says “subgraph,” AutoGen says “nested chat,” CrewAI says “hierarchical process”). There is no agent equivalent of the GoF catalog — no agreed-upon names, diagrams, or implementation contracts.

Debugging Is Harder

In composed systems, a wrong final answer may originate from any node in a potentially large graph. Stack traces don’t exist; tool calls are probabilistic; intermediate agent outputs may look correct while subtly corrupting downstream state. The Observability → page covers tooling for tracing multi-agent execution, but the fundamental difficulty remains: emergent failures in stochastic systems are hard to reproduce.

Cost Explosion

Multi-agent composition multiplies API costs. A 5-layer MoA with 5 agents per layer and 2000-token responses generates ~50,000 output tokens per inference pass (25 agent calls × ~2000 tokens each, not counting input context accumulation across layers — total token consumption is considerably higher). Production systems must carefully budget, prune, and cache. The relationship between agent count, depth, and task performance is poorly characterized empirically.

Error Propagation Across Agent Boundaries

Errors in composed systems propagate and amplify. MetaGPT explicitly identified “cascading hallucinations caused by naively chaining LLMs” as the primary challenge and addressed it with structured intermediate artifacts. But the general solution — validation at every boundary — adds latency and requires human-designed schemas that may not generalize.

Emergence of Agent Personalities

When agents are given strong role definitions, they may develop behaviors unexpected by the system designer — not from capability failures but from role-playing dynamics. Multi-agent systems may need personality auditing as a design practice.

References

Papers

Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.
Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., & Wang, Y. (2023). Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models. arXiv:2310.04406.
Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., & Zou, J. (2024). Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv:2406.04692.
Hong, S., Zhuge, M., Chen, J., et al. (2023). MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352.
Wu, Q., Bansal, G., Zhang, J., et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
Shao, Y., Gong, Y., Shen, Y., et al. (2024). Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (STORM). arXiv:2402.14207. NAACL 2024.
Bai, Y., Jones, A., Ndousse, K., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
Zhao, J., Liu, Y., Han, W., et al. (2024). AFlow: Automating Agentic Workflow Generation. arXiv:2410.10762. ICLR 2025.

Blog Posts & Resources

Anthropic. (2024). Building Effective Agents. Anthropic Engineering Blog.
Anthropic. (2025). Building multi-agent systems: When and how to use them. Anthropic / Claude Blog.
LangChain. LangGraph Overview. LangChain Docs.
Temporal. Durable Execution Meets AI. Temporal Blog.
Carpintero, D. (2025). Design Patterns for Building Agentic Workflows. Hugging Face Blog.

Code & Projects

LangGraph — Low-level agent orchestration framework (LangChain). StateGraph-based pattern language for agents.
AutoGen — Microsoft’s multi-agent conversation framework. Supports diverse agent topologies via customizable conversation patterns.
CrewAI — Role-playing multi-agent orchestration framework. Sequential and hierarchical process modes.
MetaGPT — SOP-driven software-engineering multi-agent system.
STORM — Stanford’s LLM knowledge curation system; hierarchical multi-agent article generation.
LATS — Language Agent Tree Search; MCTS-based hierarchical reasoning.
AFlow — Automated agentic workflow generation via MCTS over workflow operators. ICLR 2025.
Temporal — Durable execution engine for long-running agent workflows.
Inngest — Event-driven durable workflow platform for serverless environments.
Semantic Kernel — Microsoft’s agent SDK with a Process Framework for structured workflow composition.

Back to Deep Dives → · See also: Multi-Agent Systems → · Infrastructure & Protocols → · Observability →