Evaluation & Benchmarks

How do we measure whether an agent is actually good?

The Evaluation Problem

Evaluating LLM agents is harder than evaluating LLMs. A chatbot can be evaluated on held-out Q&A pairs. An agent takes sequences of actions, operates in non-deterministic environments, and pursues goals that may have multiple valid solutions. Several key challenges:

  • Process vs. outcome: A wrong final answer might come from correct reasoning; a right answer might come from lucky guessing
  • Benchmark contamination: Training data may include solutions; agents may overfit to specific environments
  • Cost: Evaluation requires running the full agent loop — expensive at scale
  • Open-ended tasks: Some tasks (research, creative work) have no clear ground truth
  • Adversarial environments: Real-world deployment involves hostile inputs, not clean benchmarks

The field has iterated through several generations of benchmarks, each trying to close gaps from the previous generation.


Coding Benchmarks

SWE-bench (2023)

Jimenez et al. · arXiv:2310.06770 · swebench.com

The dominant coding agent benchmark. Agents resolve real GitHub issues from 12 popular Python repositories (Django, scikit-learn, Flask, etc.). Tested against actual unit tests from the repos.

Variants: - SWE-bench Full — 2,294 issues; original benchmark - SWE-bench Verified — 500 manually verified high-quality issues; the standard leaderboard - SWE-bench Lite — 300 issues; faster/cheaper evaluation

Progress over time (early scores are on SWE-bench Full/subsets; SWE-bench Verified was released Aug 2024 and became the standard leaderboard):

Date System Benchmark Score
May 2024 SWE-agent + GPT-4 Turbo (Princeton) Full (2,294 tasks) 12.47%
Mar 2024 Devin (Cognition AI) Full subset (570 tasks) 13.86%
Late 2024 Claude 3.5 Sonnet + scaffolding Verified (500 tasks) ~49%
Oct 2025 Claude Sonnet 4.5 Verified 77.2% (82.0% with parallel compute)
Nov 2025 Gemini 3 Pro + Live-SWE-agent Verified 77.4%

Note on Devin’s 13.86%: Cognition AI ran Devin on a 570-task subset drawn from the SWE-bench Full set — not SWE-bench Lite (which is the officially defined 300-task subset released separately). They resolved 79 of 570 issues = 13.86%.

Contamination concern: By late 2025, SWE-bench Verified showed signs of saturation and training data leakage. Models scoring 70%+ on the public Verified set drop to ~15-25% on SWE-bench Pro (Scale AI’s private-codebase variant, released 2025).

SWE-bench Pro (Sep 2025)

Scale AI · Novel private-codebase variant

Created specifically to address contamination. Issues come from private repositories never seen in training data. Performance drops dramatically across all models: - GPT-5: ~23% (public) → ~15% (private) (approximate; source needed) - Claude Opus 4.1: ~22.7% (approximate; source needed)

Reveals the extent to which SWE-bench public scores reflect memorization vs. generalization.

HumanEval (2021)

Chen et al. (OpenAI) · github.com/openai/human-eval · Python function completion from docstrings. Early coding benchmark, largely saturated by 2024 but still used for baseline comparisons.

Aider Coding Leaderboard

aider.chat/docs/leaderboards

Pragmatic leaderboard maintained by the Aider project, ranking models by real-world coding utility (polyglot, code editing, refactoring). Considered more representative of practical use than HumanEval.


Web & Computer Use Benchmarks

WebArena (2023)

Zhou et al. · arXiv:2307.13854

5 functional websites (e-commerce, Reddit, GitLab, CMS, map) plus Wikipedia as a reference resource. 812 long-horizon tasks. Baseline GPT-4: ~14% (exact: 14.41%); human performance: 78.24%. Realistic web tasks that require multi-step navigation, form filling, and information retrieval.

Mind2Web (2023)

Deng et al. · arXiv:2306.06070

Dataset of over 2,000 tasks (approximately 2,350) across 137 real websites spanning 31 domains. Tests generalization to unseen sites. More diverse than WebArena but less dynamic (static task descriptions).

OSWorld (2024)

Xie et al. · arXiv:2404.07972

Desktop GUI tasks across real applications on Ubuntu, Windows, and macOS. 369 tasks requiring interaction with actual installed software (spreadsheets, code editors, file managers).

Human expert baseline: 72.36%

Date Best Agent Score
Apr 2024 ~12% (GPT-4V baseline)
Oct 2024 ~22% (Claude 3.5 Sonnet with Computer Use)
Early 2025 ~25% (OSCAR scaffold + GPT-4o)
2025 76%+ (specialized GUI agents, surpassing human baseline)

ScreenSpot-Pro (2025)

Li et al. · arXiv:2504.07981

High-resolution GUI grounding benchmark for professional applications, spanning 23 applications across five industries and three operating systems. Targets tiny UI elements in complex, high-resolution professional displays. At time of publication, the best model (without search augmentation) achieved only 18.9%. OmniParser V2 + GPT-4o reaches 39.6% on the benchmark leaderboard; raw GPT-4o achieves 0.8% — showing the critical importance of specialized screen parsing. The authors’ ScreenSeekeR method achieves 48.1%.


General Capability Benchmarks

GAIA (2023)

Mialon et al. (Meta AI) · arXiv:2311.12983 · ICLR 2024

General AI assistant tasks requiring multi-step reasoning, tool use, and common sense. Three difficulty levels. Human baseline: 92%; GPT-4 with plugins at release: ~15%. (arXiv:2311.12983) Designed to be “easy for humans, hard for AI.”

AgentBench (2023)

Liu et al. · arXiv:2308.03688

8-environment benchmark: operating system, database, knowledge graph, digital card game, lateral thinking puzzles, house-holding (ALFWorld), web shopping, web browsing. First systematic multi-domain agent evaluation. Revealed large gap between GPT-4 and open-source models.

tau-bench (2024)

Yao et al. · arXiv:2406.12045

Tests agent reliability in retail and airline customer service domains via dynamic simulated conversations. Focuses on multi-turn conversation with tool use and policy compliance. Introduces the pass^k metric to measure behavioral consistency across repeated trials. Even state-of-the-art agents (e.g. GPT-4o) succeed on fewer than 50% of tasks and show high inconsistency (pass^8 <25% in retail).

WorkArena (2024)

Drouin et al. (ServiceNow) · arXiv:2403.07718

Enterprise software agent benchmark using the ServiceNow platform. Tests agents on realistic business workflows covering knowledge work tasks. Also introduces BrowserGym, an environment for designing and evaluating web agents.


Long-Horizon & Time-Based Measurement

METR: Task Time Horizons (2025)

METR (Model Evaluation & Threat Research) · metr.org/time-horizons · metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks

A fundamentally different approach: instead of accuracy on fixed tasks, measure the maximum task duration an agent can handle with 50% success rate.

Key findings: - The time horizon has doubled approximately every 7 months since 2019 - Claude 3.7 Sonnet (Mar 2025): ~1 hour horizon - GPT-5 (late 2025): ~2 hours 17 minutes - Extrapolation: multi-day task capability within a few years

Why this matters: Time horizon maps directly to real-world utility. A 1-hour agent can write a blog post or fix a bug. A 24-hour agent could run a full experiment pipeline. The doubling rate is one of the most striking quantitative facts about agent progress.


Science-Specific Evaluation

PaperBench (2025)

Tests ability to reproduce results from ML papers. PaperCoder (Seo et al., ICLR 2026) demonstrates SOTA on this benchmark by transforming papers into working code repositories.

LabBench (Biology, 2024)

FutureHouse · Measures capabilities for biology research tasks. FutureHouse’s agents outperform frontier models on retrieval precision and accuracy.

BixBench (Bioinformatics, 2025)

FutureHouse · Bioinformatics-specific benchmark for scientific agents.


Safety & Adversarial Evaluation

ST-WebAgentBench (2024)

Safety and trustworthiness evaluation for web agents. Tests whether agents correctly follow safety instructions and avoid harmful actions.

AgentDojo (2024)

Security benchmark specifically for tool-calling agents under prompt injection attacks.

AgentHarm (2024)

Evaluates agent tendency toward harmful behaviors across multiple categories.


The Benchmark Lifecycle Problem

A pattern has emerged in agent benchmarking:

  1. Release — New benchmark with human baseline; agent baseline far below
  2. Progress — Models improve dramatically over 12-24 months
  3. Saturation — Top models approach human level; variance dominates
  4. Contamination — Training data likely includes benchmark solutions
  5. Replacement — A new, harder benchmark is released

SWE-bench is currently in stage 3-4. WebArena and GAIA are in stage 2-3. OSWorld is in stage 1-2.

The deeper issue: Every benchmark embeds assumptions about what “good” means. SWE-bench assumes GitHub issues are representative of real coding work. GAIA assumes its tasks represent general intelligence. As the field matures, evaluation methodology itself is becoming a research area.

Key Open Problems in Agent Evaluation

  • How do you evaluate open-ended research tasks?
  • How do you measure reliability over long task horizons without enormous cost?
  • How do you prevent benchmark contamination in an era of internet-scale training?
  • How do you evaluate multi-agent systems where individual performance is not the right unit?
  • How do you measure safety without being able to enumerate all failure modes?


References

Core Benchmarks & Papers

  • SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2023) — arXiv:2310.06770 · swebench.com · ICLR 2024
  • WebArena: A Realistic Web Environment for Building Autonomous Agents (Zhou et al., 2023) — arXiv:2307.13854
  • Mind2Web: Towards a Generalist Agent for the Web (Deng et al., 2023) — arXiv:2306.06070 · NeurIPS 2023
  • OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Xie et al., 2024) — arXiv:2404.07972
  • GAIA: A Benchmark for General AI Assistants (Mialon et al., 2023, Meta AI) — arXiv:2311.12983 · ICLR 2024
  • AgentBench: Evaluating LLMs as Agents (Liu et al., 2023) — arXiv:2308.03688 · ICLR 2024
  • ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use (Li et al., 2025) — arXiv:2504.07981

Long-Horizon & Time-Based Evaluation

Science-Specific Benchmarks

  • PaperBench — Tests ability to reproduce results from ML papers
  • PaperCoder: Automated Code Synthesis for Scientific Papers (Seo et al., ICLR 2026)
  • LabBench (FutureHouse, 2024) — Biology research tasks
  • BixBench (FutureHouse, 2025) — Bioinformatics benchmark

Safety & Adversarial Evaluation

  • ST-WebAgentBench: Evaluating Safety and Trustworthiness of Web Agents (2024)
  • AgentDojo: A Security Benchmark for LLM Agents (2024) — Tests robustness to prompt injection attacks
  • AgentHarm: Evaluating the Harmful Behaviors of LLM Agents (2024)

Domain-Specific Benchmarks

  • τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Yao et al., 2024) — arXiv:2406.12045 — Retail and airline customer service domains
  • WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? (Drouin et al., ServiceNow, 2024) — arXiv:2403.07718

Practitioner Resources

  • Aider Coding Leaderboardaider.chat/docs/leaderboards
  • HumanEval Benchmark (Chen et al., OpenAI, 2021) — Python function completion baseline

See also: METR time horizons · SWE-bench in context · Safety benchmarks