Agent Societies & Simulation
Multi-agent worlds, social simulation, and emergent collective behavior
Overview
The dominant framing of LLM agents positions them as tools — systems invoked to complete discrete tasks: write this code, search that document, schedule that meeting. But an equally compelling paradigm treats agents not as tools but as inhabitants: autonomous entities with memories, goals, and social ties, populating simulated worlds that unfold over time.
This framing has roots in classical agent-based modeling (ABM), a social-simulation technique used for decades in economics, epidemiology, and political science. ABM researchers populate virtual worlds with rule-following agents and watch macro-phenomena emerge from micro-interactions — segregation patterns, market crashes, disease spread. The critical limitation was always the agents themselves: they followed rigid rules, not the flexible, context-sensitive reasoning of real humans.
Large language models change that equation. LLM-powered agents can communicate in natural language, reason about social context, form opinions, remember past events, and act in ways that are at least plausible analogues to human behavior. The result is a new class of Generative Agent-Based Models (GABMs) — simulations where agents reason, converse, and adapt rather than follow predetermined scripts.
Why It Matters
- Social science research: complex social phenomena (information cascades, norm emergence, cooperation dynamics) can be studied in controlled, repeatable simulations without recruiting human participants.
- Policy testing: policymakers can probe how populations might respond to interventions before deploying them in the real world.
- Game design: NPCs can behave as persistent characters with memories and motivations rather than scripted automatons.
- AI alignment: studying many-agent societies reveals how AI systems interact at scale, with implications for safety and collective behavior governance.
- Synthetic data generation: simulated social interactions produce conversational data, behavioral traces, and interaction logs that can train or evaluate other AI systems.
While related to multi-agent systems, the distinctive emphasis here is on social dynamics — norm formation, cultural transmission, cooperation and defection — and on simulation as a research method for understanding complex systems. For a comprehensive survey of the field, see Mou et al. (2024), From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents.
Foundational Architectures
Generative Agents (Park et al., 2023)
The work that catalyzed the field is Generative Agents: Interactive Simulacra of Human Behavior (Park, O’Brien, Cai, Morris, Liang, Bernstein; Stanford/Google, 2023). The paper introduces an architecture for agents that wake up, cook breakfast, and head to work — computational characters with three interacting modules:
- Observation: agents perceive their environment and record events in a natural-language memory stream
- Reflection: agents periodically synthesize memories into higher-level insights (“Isabella seems to be a kind person who enjoys social activities”)
- Planning: agents use reflections to produce day-level plans that constrain moment-to-moment action
These agents are deployed in Smallville, a sprite-based sandbox world reminiscent of The Sims, populated by 25 agents. Starting from a single user-specified seed — one agent wants to throw a Valentine’s Day party — agents autonomously spread invitations, make new acquaintances, ask each other on dates, and coordinate attendance, all through natural language dialogue. Ablation studies confirm that removing any of the three architectural components measurably degrades behavioral believability. Code: github.com/joonspk-research/generative_agents.
This paper established the canonical observe–reflect–plan architecture that most subsequent social simulation work builds on or departs from.
CAMEL: Role-Playing as a Research Method
CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society (Li et al., NeurIPS 2023) takes a more minimalist approach: two LLM agents are assigned complementary roles (AI assistant, AI user) and left to cooperate on tasks via role-playing. The key mechanism is inception prompting — system prompts that guide agents toward task completion while preserving role consistency and preventing one agent from assuming the other’s identity. CAMEL demonstrates that role-playing can generate rich datasets for studying the behavioral and cognitive dynamics of agent societies without requiring elaborate simulation infrastructure. Library: github.com/camel-ai/camel.
MetaGPT: The Software Company
MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework (Hong, Zhuge, Chen et al., ICLR 2024) simulates a software company populated by specialized LLM agents — a product manager, architect, project manager, engineer, and QA engineer. Each role maps to a real organizational function, and interaction is structured by Standardized Operating Procedures (SOPs) encoded into prompt sequences. MetaGPT outperforms prior chat-based multi-agent systems on collaborative software engineering benchmarks by reducing cascading hallucinations through structured inter-agent verification. Beyond its engineering utility, MetaGPT illustrates how organizational structure and division of labor can be faithfully encoded in agent-based systems. Framework: github.com/FoundationAgents/MetaGPT.
ChatDev: Communication as the Unifying Substrate
Communicative Agents for Software Development (Qian, Liu, Liu et al., ACL 2024) introduces ChatDev, where agents take on organizational roles (CEO, CTO, programmer, reviewer, tester) and collaborate through a structured chat chain. A key contribution is communicative dehallucination — mechanisms that prevent agents from accepting hallucinated code artifacts during review. ChatDev demonstrates that natural language can serve as a universal substrate for multi-agent coordination: system design is conducted in prose, while code artifacts are exchanged and debugged programmatically. Code: github.com/OpenBMB/ChatDev.
Concordia (Google DeepMind)
Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia (Vezhnevets, Agapiou, Aharon et al., Google DeepMind, 2023) is the most general-purpose GABM library currently available. Key design decisions:
- A Game Master (GM) agent — inspired by tabletop role-playing games — simulates the environment, adjudicates agent actions, and translates natural-language intentions into physical or digital consequences.
- A component system mediating between LLM calls and associative memory retrieval, giving researchers fine-grained control over what agents know and remember.
- Support for digital environments where the GM makes API calls to external tools (calendar, email, search), enabling simulation of human-software interaction at scale.
Concordia is explicitly positioned as a scientific instrument: its design supports both basic social-science research and applied evaluation of real digital services through synthetic user simulation. Code: github.com/google-deepmind/concordia.
AgentSims
AgentSims: An Open-Source Sandbox for Large Language Model Evaluation (Lin, Zhao, Zhang et al., 2023) approaches social simulation from an evaluation angle. Rather than studying emergent social phenomena, AgentSims provides an interactive GUI-based environment where researchers can add agents and buildings, then test custom memory, planning, and tool-use systems with minimal code. The authors argue that task-based evaluation within simulated social environments is a more robust alternative to static benchmarks, which are vulnerable to contamination and constrained in the abilities they can assess. Code: github.com/py499372727/AgentSims.
Project Sid: Toward AI Civilization
Project Sid: Many-agent simulations toward AI civilization (Altera.AL et al., 2024) is the largest-scale social simulation to date: 10 to 1000+ AI agents placed inside a Minecraft environment and evaluated on civilizational benchmarks inspired by human history. The paper introduces the PIANO (Parallel Information Aggregation via Neural Orchestration) architecture, enabling agents to interact with both humans and other agents in real time while maintaining coherent behavior across multiple simultaneous output streams.
Without explicit instructions, agents autonomously developed specialized economic roles, formed and changed collective rules, and engaged in cultural and religious transmission across agent generations. These results suggest that LLM agents placed in rich, open-ended environments can produce civilizational-scale social dynamics rather than just completing predefined tasks. Code: github.com/altera-al/project-sid.
Economic & Market Simulations
EconAgent: Macroeconomic ABM
EconAgent: Large Language Model-Empowered Agents for Simulating Macroeconomic Activities (Li et al., ACL 2024) introduces the first LLM-based macroeconomic ABM. Traditional macroeconomic ABMs use rule-based or neural-network agents for household and firm decision-making; EconAgent replaces these with LLM agents equipped with:
- A perception module creating heterogeneous agents with distinct decision profiles (risk tolerance, employment status, savings habits)
- A memory module allowing agents to reflect on personal economic history and market trends before making consumption and labor decisions
Simulation experiments show EconAgent agents producing more realistic macroeconomic phenomena — inflation dynamics, labor market fluctuations — than rule-based baselines. The paper represents a first step toward LLM-based macroeconomic simulation as a research tool. Code: github.com/tsinghua-fib-lab/ACL24-EconAgent.
Trading and Market Dynamics
Beyond macro models, LLM agents are being studied in financial market simulations. Research exploring homo silicus — LLMs as implicit computational models of humans that can be given endowments, preferences, and information to explore behavior via simulation (Horton, 2023; arXiv:2301.07543) — has found that LLMs replicate classic behavioral economics findings in experimental settings (anchoring effects, loss aversion, herding). LLM-based trading simulations probe whether agents exhibit rational expectations, respond to information asymmetries, and generate realistic price discovery. Results both validate and challenge standard economic theory, suggesting that LLM agents capture some aspects of human economic irrationality that classical models miss, while introducing new failure modes rooted in training data biases.
Sandbox Environments & Platforms
General-Purpose Simulation Libraries
| Framework | Description | Code |
|---|---|---|
| Concordia | Google DeepMind GABM library with GM agent, component system, and associative memory | github |
| AgentSims | GUI-based sandbox for LLM evaluation via social simulation | github |
| CAMEL | Role-playing framework for communicative agent research | github |
| Generative Agents | Original Smallville codebase: 25-agent social simulation | github |
Applications
Policy Testing and Scenario Planning
Multi-agent simulations provide a sandbox for testing interventions before real-world deployment: how do communities respond to public health messaging? How does a regulatory change propagate through social networks? The S³ framework demonstrated this by simulating information diffusion on social networks with quantitative accuracy against real-world data. Concordia’s design explicitly supports policy-testing workflows by enabling researchers to operationalize social-scientific constructs — identity, group membership, institutional authority — as modular agent components.
Game NPCs and Interactive Storytelling
The Generative Agents paper noted its direct relevance to interactive entertainment — NPCs with persistent memories and social relationships that evolve in response to player actions. Rather than scripted branching dialogue, LLM-powered characters can remember past player interactions, form opinions about other characters, and generate contextually appropriate responses in real time. Industry interest is accelerating, with multiple studios exploring LLM-powered character systems for open-world games and narrative experiences.
Training Data Generation
Agent-based social simulations generate synthetic conversational data at scale: multi-turn dialogues, social interaction sequences, decision traces. Concordia’s architecture is explicitly designed to support this application — simulated users interacting with digital services provide ground truth for evaluating those services at scale. Synthetic training data from social simulations offers a way to generate diverse, annotated interaction data without the cost and logistics of human-participant studies.
Synthetic User Research
Rather than recruiting human participants for user research, organizations can deploy LLM agent societies to stress-test products, interfaces, or policies across a wide range of synthetic personas — exploring a much larger behavior space than traditional methods permit. This application is nascent but growing, particularly in product teams that want to probe edge cases before launch.
Limitations & Open Problems
The Validity Problem
The central methodological challenge for generative social simulation is validation: do LLM agent societies actually model human behavior, or do they model LLMs modeling human behavior? At worst, validation consists of asking an LLM to evaluate the plausibility of its own output — a strategy that raises obvious concerns about circularity and self-favoring bias. Park et al.’s original Generative Agents paper used human judge evaluations and behavioral prediction accuracy as proxies for validity; these methods scale poorly and remain contested as adequate ground truth. Rigorous validation methodology for GABMs remains an open research problem.
The Homogeneity Problem
All agents in most current simulations share the same base model, meaning they share the same training distribution, cultural assumptions, and implicit values. A society of GPT-4-based agents does not model human diversity — it models one model’s compressed representation of human diversity, with all the associated biases and stereotypes. Empirical evidence shows LLMs producing a much narrower range of behavioral outcomes than humans exhibit in equivalent situations. This homogeneity significantly limits the capacity of current simulations to reproduce social dynamics that depend on genuine demographic or ideological variation.
Solutions being explored include using multiple base models in a single simulation, applying persona-conditioning to create behavioral diversity, and hybrid approaches where LLM agents interact with classical ABM agents with explicit heterogeneity.
Scalability Costs
Running 25 agents in Smallville is a research project. Running 1000 agents — as in Project Sid — requires significant compute infrastructure. The PIANO architecture was built specifically to address real-time multi-agent orchestration at scale.
The most ambitious scaling effort to date is Open Agent Social Interaction Simulations with One Million Agents (Yang et al., 2024), which demonstrates LLM-based social simulations at the scale of one million agents using a hierarchical architecture — several orders of magnitude beyond Smallville. This scale enables study of large-scale social phenomena (information diffusion, opinion dynamics, collective behavior) that are simply invisible at 25- or 1000-agent scales.
As simulations grow toward the millions of agents needed for realistic social phenomena, cost becomes a fundamental constraint. Future work will likely require new architectures: smaller specialized models, caching strategies, or hybrid designs where most agents use lightweight heuristics and LLM calls are reserved for critical decision points.
Evaluation of Emergent Phenomena
Emergent social phenomena — norms, culture, cooperation regimes, language evolution — are inherently difficult to evaluate quantitatively. What does it mean for a social simulation to be correct? The Cultural Evolution paper’s finding that cooperation outcomes are sensitive to random initialization underscores that emergent phenomena may not be robust properties of the system, but artifacts of particular initial conditions. This raises deep questions about reproducibility and external validity for any claim derived from agent society simulations.
Alignment of Collective Behavior
Perhaps most consequentially: societies of agents optimizing for individual-level objectives can produce harmful collective outcomes — information cascades, collusion, price manipulation, social polarization — even without any individual agent being “misaligned.” Understanding how to specify and constrain collective behavior in multi-agent societies is an open problem with direct relevance to AI safety. The Cultural Evolution paper’s findings about model-dependent cooperation regimes suggest that which AI models are deployed at scale may have large downstream effects on the cooperative infrastructure of society.
References
Papers
- Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442.
- Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., & Ghanem, B. (2023). CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. NeurIPS 2023. arXiv:2303.17760.
- Hong, S., Zhuge, M., Chen, J., et al. (2023). MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. ICLR 2024. arXiv:2308.00352.
- Qian, C., Liu, W., Liu, H., et al. (2023). Communicative Agents for Software Development (ChatDev). ACL 2024. arXiv:2307.07924.
- Vezhnevets, A. S., Agapiou, J. P., Aharon, A., et al. (2023). Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia. arXiv:2312.03664.
- Lin, J., Zhao, H., Zhang, A., Wu, Y., Ping, H., & Chen, Q. (2023). AgentSims: An Open-Source Sandbox for Large Language Model Evaluation. arXiv:2308.04026.
- Altera.AL (2024). Project Sid: Many-agent simulations toward AI civilization. arXiv:2411.00114.
- Yang, Z., Zhang, Z., Zheng, Z., et al. (2024). Open Agent Social Interaction Simulations with One Million Agents. arXiv:2411.11581. (Hierarchical architecture enabling million-agent social simulation; studies information diffusion and opinion dynamics at societal scale)
- Wang, G., Xie, Y., Jiang, Y., et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291.
- Lan, X., et al. (2023). S³: Social-network Simulation System with Large Language Model-Empowered Agents. arXiv:2307.14984.
- Horiguchi, I., et al. (2024). Evolution of Social Norms in LLM Agents using Natural Language. arXiv:2409.00993.
- Hughes, E., et al. (2024). Cultural Evolution of Cooperation among LLM Agents. arXiv:2412.10270.
- Li, N., et al. (2024). EconAgent: Large Language Model-Empowered Agents for Simulating Macroeconomic Activities. ACL 2024. arXiv:2310.10436.
- Horton, J. J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?. arXiv:2301.07543.
- Huynh, T.-K., et al. (2025). Understanding LLM Agent Behaviours via Game Theory: Strategy Recognition, Biases and Multi-Agent Dynamics. arXiv:2512.07462.
- Mou, X., Ding, X., He, Q., et al. (2024). From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents. arXiv:2412.03563.
- Anthis, J. R., Liu, R., Richardson, S. M., et al. (2025). LLM Social Simulations Are a Promising Research Method. ICML 2025. arXiv:2504.02234.
Blog Posts & Resources
Code & Projects
- joonspk-research/generative_agents — Park et al. Smallville simulation codebase
- google-deepmind/concordia — Concordia GABM library (Google DeepMind)
- FoundationAgents/MetaGPT — MetaGPT multi-agent framework
- OpenBMB/ChatDev — ChatDev software development simulation
- camel-ai/camel — CAMEL communicative agents library
- py499372727/AgentSims — AgentSims open-source sandbox
- altera-al/project-sid — Project Sid many-agent simulation
- MineDojo/Voyager — Voyager Minecraft lifelong learning agent
- tsinghua-fib-lab/ACL24-EconAgent — EconAgent macroeconomic simulation
Back to Topics → · See also: Multi-Agent Systems → · Long-Horizon Autonomy → · Safety →
Social Norm Emergence
Evolution of Social Norms in LLM Agents using Natural Language (Horiguchi et al., 2024) builds on Axelrod’s classic metanorm games to test whether LLM agents can spontaneously develop norm-enforcement strategies. Experiments show that agents form metanorms through dialogue — norms that enforce punishment of those who fail to punish cheaters — purely through natural language interaction, without explicit instruction to do so. This mirrors theoretical predictions about human cooperation and suggests that language is a sufficient substrate for the transmission and evolution of social norms in artificial populations.