Long-Horizon Autonomy

Agents that work for hours, days, or indefinitely — the frontier of sustained autonomous operation

“The length of tasks that frontier AI agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years.”METR, March 2025


1. What Makes Long-Horizon Different?

Most LLM benchmarks measure performance on tasks that take seconds or minutes: answer a question, write a function, translate a sentence. But the most consequential uses of AI — conducting research, building software systems, running experiments — unfold over hours, days, or weeks. This gap between benchmark performance and real-world utility defines what researchers call the long-horizon problem.

A rough taxonomy of task durations:

Horizon Duration Examples
Short Seconds–minutes Q&A, code completion, summarization
Medium Minutes–hours Debugging a codebase, drafting a report
Long Hours–days Software projects, scientific experiments
Extended Days–weeks Research campaigns, full product builds

Long-horizon tasks are not merely longer versions of short tasks — they are qualitatively different in several ways:

  • Error accumulation: Each step introduces some probability of failure; over many steps, even high per-step reliability collapses to near-zero overall reliability.
  • Context drift: The agent’s understanding of its goal and progress can diverge from reality as the task evolves.
  • Changing world state: External conditions (files, APIs, web pages, human preferences) shift during execution, invalidating earlier observations.
  • Goal persistence: The agent must maintain coherent intent across interruptions, memory resets, and context window rollovers.

METR’s Time-Horizon Framework

METR (Model Evaluation & Threat Research) has developed the most rigorous framework for quantifying long-horizon capability. Their paper, “Measuring AI Ability to Complete Long Software Tasks” (Kwa et al., NeurIPS 2025) introduces the task-completion time horizon: the task duration (measured by human expert completion time) at which an AI agent succeeds with a given level of reliability.

Their methodology: 1. Gather 100+ diverse software tasks (from RE-Bench, HCAST, and novel tasks spanning software engineering, ML, and cybersecurity). 2. Have human experts attempt each task and measure completion times. 3. Fit a logistic curve to agent success rates as a function of human task duration. 4. Report the duration at which the curve hits 50% (or 80%) success probability.

Key findings from METR:

  • The 50%-time horizon has doubled every ~7 months for the past 6 years (since 2019).
  • In mid-2020, frontier models could complete tasks a human expert finishes in ~9 seconds.
  • By early 2023, that threshold reached ~4 minutes.
  • By early 2025 (measured against Claude 3.7 Sonnet), frontier models reached approximately 50 minutes (Kwa et al., NeurIPS 2025).
  • By late 2025 (Claude Opus 4.5), METR measured a 50%-time horizon of approximately ~5 hours (MIT Technology Review, February 2026; METR live tracker).
  • METR has observed this trend may be accelerating to a doubling every ~4 months in 2024–2025 (METR domain analysis, July 2025).

Extrapolating this trend, METR predicts that within a decade, AI agents could autonomously complete tasks that currently take humans days or weeks. The live time-horizon tracker is updated as new frontier models are evaluated.

Despite rapid progress, as of early 2026, while frontier models have achieved ~50% success rates on tasks requiring ~5 hours (Claude Opus 4.5), no agent consistently completes multi-hour tasks with high (>80%) reliability end-to-end without human oversight or intervention.


2. The Error Accumulation Problem

The most fundamental challenge of long-horizon autonomy is error accumulation: mistakes made early in a task cascade forward, compounding through subsequent steps.

The Mathematics of Compounding Failure

If each step in a task has independent success probability \(p\), then across \(N\) steps the probability of completing the entire task error-free is:

\[P(\text{success}) = p^N\]

For \(p = 0.95\) (95% per-step reliability):

Steps Overall Success
10 60%
20 36%
50 7.7%
100 0.6%

This is a worst-case (independent errors) analysis, but it illustrates the challenge starkly. A task requiring 50 sequential actions — well within the scope of a “medium” software project — has only a 7.7% chance of completing error-free at 95% per-step reliability.

Nuances: Not All Errors Are Equal

Shvets (2025), “Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models,” challenges the simple exponential decay model. The analysis finds that LLM errors are not uniformly distributed across tokens or steps: roughly 5–10% of tokens represent “key decision junctions” where errors cluster. This means targeted strategies at critical branching points may outperform brute-force reliability improvements across all steps.

Similarly, work on error attribution in multi-agent systems (Banerjee et al., 2025, “Where Did It All Go Wrong?”) shows that a single root-cause failure can cascade through subsequent agents and steps, derailing the entire pipeline. Identifying which agent and which step originated the error is itself a hard research problem; the paper proposes ECHO (Error attribution through Contextual Hierarchy and Objective consensus analysis) as an approach.

Recovery Strategies

Several approaches mitigate error accumulation:

  • Checkpointing and rollback: Saving verified intermediate states and reverting to the last known-good checkpoint when errors are detected.
  • Human intervention triggers: The agent flags uncertain or high-stakes decisions for human review rather than proceeding autonomously. Anthropic reports that on complex Claude Code tasks, the model stops to ask for clarification more than twice as often as humans interrupt it — a form of autonomous safety behavior.
  • Verification steps: Interleaving execution with explicit checks (tests, assertions, self-critique) to catch errors before they propagate.
  • Multi-path exploration: Maintaining multiple parallel execution paths and selecting the most promising, as proposed in Shvets (2025).
  • Episodic reflection: Reflexion (Shinn et al., 2023) stores verbal reflections in an episodic memory buffer, allowing agents to learn from failures without weight updates.

3. Memory & Context for Long Tasks

Even with perfect per-step accuracy, long-horizon agents face a fundamental resource constraint: context windows are finite.

The Context Window Ceiling

Modern frontier models have context windows of 128K–200K tokens — impressive, but insufficient for a multi-day task. A single day of agentic work (tool calls, observations, intermediate states, reasoning chains) can easily generate millions of tokens of trace. The agent cannot hold all of this in a single context window.

The challenge is not just storage but relevance: an agent working on a large codebase for days needs to recall which architectural decisions were made on day one, what the current system state is, and what happened in the most recent actions — all simultaneously.

Memory Architectures for Long-Horizon Agents

A comprehensive survey, “Memory in the Age of AI Agents” (Hu et al., 2025), identifies several major architectural approaches:

1. Sliding Window / Summarization
The simplest approach: keep only the most recent \(K\) tokens, discarding older context. A common refinement is recursive summarization: compress older exchanges into progressively shorter summaries. The risk is information loss — summaries may omit details that turn out to be critical.

2. Retrieval-Augmented Memory (RAG over past actions)
Store all past observations, actions, and outcomes in a vector database. When the agent needs context, retrieve the most semantically relevant past experiences. The “Survey on the Memory Mechanism of LLM-based Agents” (Zhang et al., 2024) surveys this and related approaches. The challenge: retrieval may miss relevant information if it’s semantically distant from the current query.

3. Episodic Memory
Explicitly record completed episodes (what was attempted, what worked, what failed) in a structured store. Reflexion (Shinn et al., 2023) uses verbal reflections as episodic memories. Voyager (Wang et al., 2023), an open-ended Minecraft agent, builds a skill library of verified, reusable code solutions — a form of procedural episodic memory that enables lifelong learning.

4. Hierarchical / Tiered Memory
Inspired by human cognition, these systems separate working memory (immediate context), episodic memory (specific past events), and semantic/long-term memory (generalized knowledge). RAISE (Liu et al., 2024) extends the ReAct framework with a dual-component memory system mirroring human short-term and long-term memory, maintaining conversational context and continuity across sessions.

5. MemGPT / Letta — Virtual Context Management
MemGPT (Packer et al., 2023) takes an OS-inspired approach: the LLM acts like an operating system managing a memory hierarchy. A small main context (analogous to RAM) holds immediately relevant information; the LLM can issue function calls to page information in and out from external storage (analogous to disk). The MemGPT research has since evolved into Letta, a production framework for building stateful agents with persistent, manageable memory.

6. Agentic Memory Learning
Recent work like AgeMem (Yu et al., 2026) goes further: rather than hard-coding memory management logic, it trains the LLM to autonomously decide what and when to store, retrieve, update, summarize, or discard information using tool-based memory actions.


4. Planning & Decomposition

Long-horizon tasks require not just memory but planning: breaking a complex, underspecified goal into achievable subtasks, sequencing them coherently, and revising the plan when the world changes.

Plan-Then-Execute vs. Interleaved Planning

Two broad strategies exist:

  • Plan-then-execute: Generate a complete plan upfront, then execute step by step. Simple to reason about but brittle — the plan is formulated with incomplete information and may become invalid as the task unfolds.
  • Interleaved planning: Generate a high-level plan, execute a few steps, observe results, revise the plan, and repeat. More adaptive but harder to evaluate and debug.

Plan-and-Solve Prompting (Wang et al., 2023) proposes a simple zero-shot strategy — instruct the LLM to devise a plan before solving — and shows this improves multi-step reasoning accuracy over standard chain-of-thought.

Hierarchical Decomposition

For truly long-horizon tasks, flat plans become unwieldy. Hierarchical task decomposition organizes goals into layers: a high-level goal decomposes into sub-goals, each of which decomposes into concrete actions.

  • HuggingGPT (Shen et al., 2023) uses ChatGPT as a controller to decompose complex tasks into sub-tasks executed by specialized HuggingFace models — an early multi-model orchestration system.
  • DEPS (Describe, Explain, Plan and Select) (Wang et al., 2023) targets open-world multi-task agents in Minecraft, using LLMs to generate and iteratively refine plans while a trainable selector ranks candidate sub-goals.
  • ADaPT (As-Needed Decomposition and Planning with Language Models) (Prasad et al., 2023) dynamically adjusts decomposition depth to match task complexity and executor capability — simple tasks are executed directly, while complex tasks are recursively decomposed.
  • Plan-and-Act (Erdogan et al., ICML 2025) proposes training data generation for hierarchical planning on web tasks, arguing that prompting alone cannot instill robust planning; it achieves a state-of-the-art 57.58% on WebArena-Lite.

Re-Planning and Recovery

A plan that was correct at step \(t\) may become invalid at step \(t+k\) because the world has changed. Robust long-horizon agents must detect when their plan is no longer valid and re-plan — potentially from scratch. This requires:

  1. Monitoring plan validity (comparing expected vs. observed world state)
  2. Detecting failure conditions without entering infinite loops
  3. Generating a new plan that accounts for the current (possibly degraded) state

Reflexion builds re-planning into the agent loop via verbal self-critique after each failed episode. Voyager uses automatic curricula that propose progressively harder sub-goals, naturally incorporating re-planning when sub-goals fail.


5. Real-World Long-Horizon Systems

Software Engineering Agents

Devin (Cognition AI, March 2024) was the first widely-publicized “fully autonomous AI software engineer,” marketed for multi-hour coding tasks. Devin resolves real-world GitHub issues end-to-end, operating within a persistent development environment (shell, browser, editor). At launch, it achieved 13.86% on the SWE-bench benchmark — far ahead of GPT-4 (1.7%) and Claude 2 (4.8%) at the time.

Claude Code (Anthropic) is Anthropic’s agentic coding product. Anthropic’s real-world autonomy study shows that among the longest-running Claude Code sessions, the time working autonomously before stopping has nearly doubled in three months, from under 25 minutes to over 45 minutes. Experienced users increasingly grant full auto-approval (over 40% of sessions), intervening only when needed.

SWE-bench (Jimenez et al., 2023) serves as a key proxy benchmark: each task involves resolving a real GitHub issue, which often requires hours of human developer time. SWE-bench was released in October 2023 with an initial RAG baseline of just 1.96%. Top agents (as of early 2026) exceed 50% on the standard benchmark — a dramatic rise — though performance on harder variants like SWE-bench Pro remains below 25%.

Computer-Use Agents

OSWorld (Xie et al., NeurIPS 2024; arXiv:2404.07972) benchmarks multimodal agents on open-ended tasks in real computer environments (Linux, Windows, macOS). Tasks involve multi-step sequences across applications (web browser, file manager, spreadsheet). Initial evaluations showed significant deficiencies in all tested agents; the benchmark has since become a standard testbed for computer-use agents.

WebArena (Zhou et al., 2024) provides realistic, self-hosted web environments for evaluating autonomous agents on multi-step web tasks (e-commerce, forums, code repositories). Initial GPT-4 performance was 14.41% vs. 78.24% human performance; by 2025, top agents reached ~60% (Medium, September 2025).

VisualWebArena (Koh et al., ACL 2024; arXiv:2401.13649) extends WebArena with multimodal (vision-based) tasks.

Scientific Research Agents

FutureHouse is a philanthropically-funded organization building semi-autonomous AI scientists for multi-day research cycles. Their platform (launched 2025) provides specialized agents for literature review, hypothesis generation, and experiment planning. Their PaperQA agent (released September 2024) is designed to be state-of-the-art for retrieving and summarizing scientific literature — a foundation for longer research pipelines.

General-Purpose Autonomous Agents

Manus (Monica.im, March 2025) is a general-purpose autonomous agent developed in China, designed to execute complex multi-step tasks (research, web browsing, coding, document generation) without continuous human input. It attracted significant attention as an early example of general-purpose agentic AI capable of extended autonomous operation across diverse domains (Forbes, March 2025). A third-party technical overview appears in arXiv:2505.02024.

Letta (formerly MemGPT) is an open-source framework for building stateful, long-running agents with persistent memory — designed explicitly for extended autonomous operation. The GitHub repository provides tools for memory management, multi-session continuity, and background agent operation.


6. Evaluation Frameworks for Long-Horizon Tasks

Why Standard Benchmarks Fail

Most AI benchmarks assume single-turn or bounded interaction: one prompt, one response, one score. They are designed for reproducibility and ease of evaluation — properties that conflict with the open-ended, stateful nature of long-horizon tasks. Key failure modes:

  • Short time budgets: Most benchmarks can be run in minutes per example, limiting the task complexity they can represent.
  • No intermediate state: Benchmarks rarely test whether an agent can recover from mid-task errors.
  • Single correct answer: Long-horizon tasks often have multiple valid paths and partial-credit solutions.
  • No environment statefulness: Real tasks involve persistent environments that change as actions are taken.

Major Evaluation Frameworks

Benchmark Focus URL
METR Time Horizons Software tasks graded by human completion time metr.org/time-horizons
GAIA Real-world multi-step assistant tasks arXiv:2311.12983
OSWorld Computer use across real OS environments os-world.github.io
WebArena Multi-step web navigation arXiv:2307.13854
VisualWebArena Multimodal web tasks arXiv:2401.13649
SWE-bench Real GitHub issue resolution swebench.com
RE-Bench R&D capability vs. human experts METR blog

GAIA (Mialon et al., 2023) — “a benchmark for General AI Assistants” — proposes real-world questions requiring multi-modal reasoning, web browsing, and tool use. Its 466 questions are tiered by difficulty, with Level 3 requiring sophisticated multi-step strategies that most agents still fail consistently. Humans score ~92%; early GPT-4 with plugins scored ~15%.

RE-Bench (METR, 2024) evaluates AI R&D capabilities against human expert ML researchers on open-ended research tasks — tasks that take humans 8 hours. It serves as one of the task sets for METR’s time-horizon measurements.


7. The Autonomy Spectrum

Not all long-horizon tasks warrant the same degree of autonomy. Deploying agents safely requires thinking about when and how much to trust an agent — a spectrum analogous to the SAE self-driving levels.

Levels of Agentic Autonomy

Level Description Example
0 — Suggest Agent proposes actions; human approves each Copilot autocomplete
1 — Checkpoint Agent executes bursts; human reviews at milestones Step-by-step approval mode
2 — Report Agent executes autonomously; human reviews afterward “Run and show me the diff”
3 — Delegated Agent executes with exception-only interruptions Claude Code auto-approve
4 — Background Agent runs silently; reports only significant events Always-on research agent

Trust, Risk, and Task Length

A key insight from Anthropic’s framework for safe and trustworthy agents is that the appropriate autonomy level depends on both task risk and established trust. Routine, low-consequence tasks can be delegated earlier; novel, high-stakes, or irreversible tasks warrant more oversight.

Anthropic’s real-world autonomy study reveals an interesting dynamic in practice: experienced Claude Code users grant more autonomy but interrupt more often — they move to full auto-approve as trust builds, but also develop finer-grained intuitions about when to intervene. New users approve ~20% of sessions in full auto-approve mode; experienced users reach over 40%.

The relationship between task length and trust is non-linear: a 5-minute autonomous task and a 5-hour autonomous task require very different levels of established trust, even if the per-step actions are similar, because the cost of discovering a mistake grows with task duration.

Corrigibility Under Long Autonomy

As autonomous sessions lengthen, alignment challenges intensify. An agent operating for 8 hours has more opportunity to: - Misinterpret its original goal - Acquire resources or capabilities beyond what the task requires - Take actions whose downstream effects were not anticipated

Anthropic’s model report explicitly evaluates “autonomy risks” — whether models exhibit behaviors indicative of unsafe goal-directed autonomy under extended operation.


8. Open Problems

Long-horizon autonomy remains a frontier with fundamental unsolved problems:

Error accumulation is the fundamental blocker. Despite nuanced findings about sparse error distribution (arXiv:2505.24187), no agent reliably completes tasks requiring consistent multi-hour autonomous operation across diverse real-world environments. The gap between benchmark performance and reliable deployment remains large.

Context management at scale is unsolved. Even with MemGPT-style paging, agents working over days must decide what to remember, what to discard, and how to compress history without losing critical details. No architecture has conclusively solved this.

Evaluation is fundamentally hard. A task that takes a human expert 8 hours cannot be evaluated in a research lab at scale. METR’s approach (contracting humans to attempt tasks) is expensive and slow. Automated evaluation of long-horizon task success remains an open research problem — especially for tasks where partial credit matters.

Re-planning under real-world change is brittle. Current agents are designed for relatively stable environments. Tasks involving external systems (live APIs, evolving codebases, other humans) introduce non-stationarity that most planning frameworks handle poorly.

Cost scales superlinearly. A long-horizon task does not just use more tokens — it uses an uncertain number of tokens, often orders of magnitude more than a short task, with cascading costs from tool calls, retrieved documents, and multi-turn context. This makes cost estimation and control a practical challenge for deployment.

The alignment problem compounds with duration. A brief misalignment lasts seconds; a sustained misalignment during a multi-day task can cascade into significant real-world harm before anyone notices. As the autonomy frontier advances, the urgency of interpretability — understanding what the agent is trying to do and why — grows proportionally.


References

Papers

Blog Posts & Resources

Code & Projects

  • METR eval-analysis-public — Analysis code for time-horizon evaluations.
  • Letta (MemGPT) — Open-source framework for stateful long-running agents with persistent memory.
  • Voyager / GitHub — Open-ended Minecraft agent with skill library and lifelong learning.
  • OSWorld — Benchmark for multimodal agents in real OS environments.
  • WebArena — Realistic multi-step web task environment.
  • SWE-bench — Software engineering benchmark on real GitHub issues.
  • GAIA benchmark — General AI assistant evaluation dataset.

Back to Topics → · See also: Evaluation → · Human-Agent Interaction → · Economics →