Reasoning & Planning

How LLM agents think, plan, and improve themselves

Overview

Reasoning and planning are the cognitive core of LLM agents. This area has evolved dramatically: from simple chain-of-thought prompting (2022) to structured tree search (2023), to RL-trained reasoning models (2024-2025) that discover reasoning strategies from scratch.

The central questions:

How can LLMs be prompted to reason step-by-step?
How can reasoning be structured (trees, graphs, hierarchies)?
How can agents self-improve via feedback?
How do we scale reasoning with test-time compute?
How can agents plan over long horizons?

Thinking Fast and Slow

Plaat et al. (2025) draw an explicit connection to Kahneman’s dual-process theory: standard LLM generation resembles System 1 (fast, intuitive, associative), while deliberate chain-of-thought and tree-search reasoning resembles System 2 (slow, deliberate, effortful). RL-trained reasoning models like o1/o3 and DeepSeek-R1 represent an attempt to move LLMs further into System 2 territory — not through prompting, but through training. The implication: prompting tricks approximate deliberation; only training internalizes it.

Chain-of-Thought: The Foundation

Chain-of-Thought Prompting (2022)

Wei et al. · arXiv:2201.11903 · NeurIPS 2022

The root of all LLM reasoning work. Providing step-by-step reasoning examples in the prompt dramatically improves performance on arithmetic, commonsense, and symbolic tasks. An emergent capability — most powerful in models ≥100B parameters.

Result: 540B model with 8 CoT examples achieves SOTA on GSM8K math

Zero-Shot Chain-of-Thought (2022)

Kojima et al. · arXiv:2205.11916 · NeurIPS 2022

Astonishing finding: adding “Let’s think step by step” to a prompt, with no examples, elicits multi-step reasoning. Simpler than few-shot CoT but often nearly as effective.

Self-Consistency (2022)

Wang et al. · arXiv:2203.11171 · ICLR 2023

Instead of greedy decoding, sample multiple diverse reasoning paths and take a majority vote. Multiple paths to the same answer increase confidence. Significant gains on arithmetic and commonsense tasks.

Least-to-Most Prompting (2022)

Zhou et al. · arXiv:2205.10625

Decompose a hard problem into easier sub-problems, then solve them sequentially, each building on the previous. Better generalization than standard CoT on compositional tasks.

Structured Reasoning: Trees, Graphs, and Beyond

Tree of Thoughts (2023)

Yao et al. · arXiv:2305.10601 · NeurIPS 2023

Major advance over CoT. Treats problem-solving as tree search. The model generates multiple candidate “thoughts” (intermediate reasoning steps), evaluates each, and uses BFS or DFS to explore promising branches. Enables backtracking and lookahead.

Game of 24: 4% (CoT) → 74% (ToT)
Key ideas: Thought decomposition; self-evaluation of intermediate states; deliberate exploration
GitHub: princeton-nlp/tree-of-thought-llm

Graph of Thoughts (2023)

Besta et al. · arXiv:2308.09687 · AAAI 2024

Extends Tree of Thoughts to arbitrary graph structures: thoughts can merge (aggregate), loop, or branch in non-tree patterns. More expressive; outperforms ToT on sorting tasks (+62%) with lower compute.

Algorithm of Thoughts (2023)

Sel et al. · arXiv:2308.10379

Encodes classic algorithms (DFS, BFS) directly into the reasoning trace. The LLM follows a structured algorithmic process, improving reliability on search and optimization problems.

ReAct: Synergizing Reasoning and Acting (2023)

Yao et al. · arXiv:2210.03629 · ICLR 2023

Interleaves reasoning traces with tool-use actions. The model thinks about what to do, acts, observes results, and thinks again. The canonical agent reasoning loop.

(See Foundations for full entry)

Planning Architectures

LLM+P: Combining LLMs with Classical Planning (2023)

Liu et al. · arXiv:2304.11477 · ICML 2024

LLMs translate natural language problem descriptions into PDDL (formal planning language), then classical planners (guaranteed optimal) solve them. Combines LLM flexibility with planning rigor.

Key insight: LLMs are good at understanding; classical planners are good at optimizing

Plan-and-Execute Agents (2023)

Chase / LangChain · Blog post

Separate the planner (generates a high-level multi-step plan) from the executor (carries out each step). Cleaner architecture than monolithic ReAct; easier to monitor and debug.

DEPS: Describe, Explain, Plan and Select (2023)

Wang et al. · arXiv:2302.01560

Interactive planning framework for open-world agents (Minecraft). Uses natural language descriptions and explanations to guide selection of sub-tasks, with error recovery.

Inner Monologue: Embodied Reasoning via Language Feedback (2022)

Huang et al. · arXiv:2207.05608 · CoRL 2022

Robots form “inner monologues” — natural language reasoning about failures, informed by environment feedback. Iterative plan refinement based on real-world observations.

Reasoning via Planning (RAP) (2023)

Hao et al. · arXiv:2305.14992

Uses the LLM as both a world model and a reasoning agent. Monte Carlo Tree Search over a space of reasoning actions, with the LLM evaluating states. Outperforms CoT and ToT on mathematical reasoning.

ADaPT: As-Needed Decomposition and Planning (2023)

Trivedi et al. · arXiv:2311.05772 · NAACL 2024

Recursive decomposition that re-decomposes sub-tasks if they fail. Addresses brittleness of fixed decomposition plans.

Step-Back Prompting (2023)

Zou et al. · arXiv:2310.06117 · ICLR 2024

Derives high-level abstractions and first principles before solving specific instances. Improves PaLM-2L on MMLU Physics by 7%, Chemistry by 11%, TimeQA by 27%, and multi-hop reasoning (MuSiQue) by 7%. “Step back, think at the abstract level, then solve.”

Decomposed Prompting (2022)

Khot et al. · arXiv:2210.02406 · EMNLP 2022

Breaks complex tasks into modular sub-prompts, each solving a specific sub-task. Enables debugging and composition.

Reflection & Self-Improvement

A major theme of 2023: agents that evaluate and improve their own outputs.

Reflexion: Language Agents with Verbal Reinforcement Learning (2023)

Shinn et al. · arXiv:2303.11366 · NeurIPS 2023

Agents generate verbal reflections on their failures, stored in an episodic memory buffer. On the next attempt, the agent incorporates its own critique. No gradient updates — reinforcement via language alone.

Results: HumanEval coding: GPT-4 baseline 80% → 91% with Reflexion; significant improvements on decision-making tasks
Key ideas: Verbal reinforcement; episodic memory of failures; trial-and-learn loop
GitHub: noahshinn/reflexion

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (2023)

Gou et al. · arXiv:2305.11738 · ICLR 2024

Extends self-critique by grounding verification in tool use — uses web search, code execution, or calculators to verify claims, then corrects based on evidence.

Key insight: Tool-grounded critique is more reliable than pure self-evaluation

Constitutional AI / Self-Critique (2022)

Bai et al. (Anthropic) · arXiv:2212.08073

Uses a set of principles (“constitution”) to guide the model in critiquing and revising its outputs. Foundation for AI alignment work with implications for agent safety.

Search-Based Planning

Monte Carlo Tree Search for LLM Reasoning (2024)

Multiple papers explore MCTS over reasoning traces:

Scaling LLM Test-Time Compute (Snell et al., 2024) · arXiv:2408.03314 — Shows targeted test-time compute scaling outperforms using a larger model
RAP with Monte Carlo (Hao et al., 2023) — MCTS over world-model states
MCTS+Reflexion combinations explored in 2024

Large Language Monkeys: Sampling and Majority Vote (2024)

Brown et al. · arXiv:2407.21787

Repeated sampling with majority vote reveals that coverage (any correct solution found) scales well with samples, even when individual attempts are low probability. Implications for test-time compute allocation.

Reasoning Models: RL for Reasoning (2024-2025)

A paradigm shift: instead of prompting strategies, train the model to reason using reinforcement learning.

OpenAI o1 / o3 (2024-2025)

OpenAI · Technical report

Models trained with RL to produce long internal “chains of thought” before answering. Dramatically improves performance on math, coding, and scientific reasoning. Marks the transition from “prompting for reasoning” to “models that reason natively.”

o1 benchmark: Near-human performance on Olympiad math; 89th percentile on competitive programming
o3: Solves 25% of FrontierMath problems; high ARC-AGI-1 scores
Implication for agents: Reasoning capability is now a first-class model feature

DeepSeek-R1 (2025)

DeepSeek AI · arXiv:2501.12948

Open-source reasoning model trained with GRPO (Group Relative Policy Optimization). Shows that RL-based reasoning can be achieved without supervised reasoning traces — the model discovers reasoning strategies from scratch.

Key insight: Reasoning emerges from RL reward; “aha moments” observed in training
Impact: First open-weight model matching o1-level reasoning on math

s1: Simple Test-Time Scaling (2025)

Muennighoff et al. · arXiv:2501.19393

Shows that even a small, fine-tuned open model can match o1 performance on math reasoning by training on a curated set of 1,000 “hard” reasoning problems. Budget forcing via extended thinking tokens.

World Models for Planning

Language Models as World Models (2023)

Research thread exploring LLMs as simulators of environment dynamics for planning purposes.

Reasoning via Planning (RAP) — explicit world model use for tree search
WorldGPT — generating world simulations for agent planning
Long-horizon planning remains an open problem; world models help bridge the gap

Benchmarks for Reasoning & Planning

Benchmark	Focus	Key Papers
GSM8K	Grade school math	Cobbe et al. (2021)
MATH	Competition math	Hendrycks et al. (2021)
HumanEval	Code generation	Chen et al. (2021)
Game of 24	Combinatorial reasoning	Tree of Thoughts (2023)
ALFWorld	Embodied planning	Shridhar et al. (2021)
WebShop	Web navigation + purchasing	Yao et al. (2022)
MuSiQue	Multi-hop QA	Trivedi et al. (2022)
FrontierMath	Research-level math	Glazer et al. (2024)
ARC-AGI	Abstract reasoning	Chollet (2019)

Key Concepts & Taxonomy

Reasoning Paradigms

Paradigm	Description	Strength
Chain-of-Thought	Linear reasoning trace	Simple, widely applicable
Self-Consistency	Majority vote over multiple paths	Robustness, reliability
Tree of Thoughts	Branching search over thoughts	Complex problems, backtracking
ReAct	Interleaved reasoning + action	Tool use, grounded reasoning
Reflexion	Verbal RL from failure	Iterative improvement
RL Reasoning (o1)	Trained chain-of-thought	Deep, complex reasoning

Planning Strategies

Strategy	When to Use
Direct	Simple tasks, clear actions
Plan-and-Execute	Multi-step tasks, need structure
Hierarchical	Long-horizon tasks, sub-goal decomposition
Reflective/Iterative	Tasks with verifiable outcomes
Search-Based	Complex optimization, multiple valid paths

References

Chain-of-Thought & Reasoning Foundations

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022) — arXiv:2201.11903 — NeurIPS 2022 — Root work showing step-by-step reasoning dramatically improves performance
Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022) — arXiv:2205.11916 — NeurIPS 2022 — Zero-shot CoT: adding “Let’s think step by step” elicits reasoning without examples
Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022) — arXiv:2203.11171 — ICLR 2023 — Majority voting over multiple reasoning paths improves robustness
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models (Zhou et al., 2022) — arXiv:2205.10625 — Decompose hard problems into easier sub-problems solved sequentially

Structured Reasoning & Tree Search

Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023) — arXiv:2305.10601 — NeurIPS 2023 — Major advance: treats reasoning as tree search with backtracking — GitHub: princeton-nlp/tree-of-thought-llm
Graph of Thoughts: Solving Elaborate Problems with Large Language Models (Besta et al., 2023) — arXiv:2308.09687 — AAAI 2024 — Extends tree search to arbitrary graph structures
Algorithm of Thoughts: Augmenting Language Models with Explicit Planning (Sel et al., 2023) — arXiv:2308.10379 — Encodes classic algorithms (DFS, BFS) directly into reasoning traces

Agent Reasoning Loop

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023) — arXiv:2210.03629 — ICLR 2023 — Canonical agent reasoning loop: interleaves thinking, action, and observation

Planning Architectures

LLM+P: Large Language Models as Programmers in Zero-Shot Program Synthesis (Liu et al., 2023) — arXiv:2304.11477 — ICML 2024 — Translates NL to PDDL for classical planners
Plan-and-Execute Agents (Chase / LangChain, 2023) — Blog post — Separates planner from executor for cleaner architecture
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models for Open-World Agents (Wang et al., 2023) — arXiv:2302.01560 — Interactive planning with error recovery for Minecraft
Inner Monologue: Embodied Reasoning through Language (Huang et al., 2022) — arXiv:2207.05608 — CoRL 2022 — Natural language reasoning about failures for robot planning
Reasoning via Planning (Hao et al., 2023) — arXiv:2305.14992 — LLM as world model + reasoning agent with MCTS
ADaPT: As-Needed Decomposition and Planning with Language Models (Trivedi et al., 2023) — arXiv:2311.05772 — NAACL 2024 — Recursive decomposition with re-decomposition on failure
Step-Back Prompting: Towards Principled In-Context Prompting and Decoding (Zou et al., 2023) — arXiv:2310.06117 — ICLR 2024 — Derive high-level abstractions before solving instances
Decomposed Prompting: A Modular Approach for Solving Complex Tasks (Khot et al., 2022) — arXiv:2210.02406 — EMNLP 2022 — Breaks tasks into modular sub-prompts

Reflection & Self-Improvement

Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023) — arXiv:2303.11366 — NeurIPS 2023 — Agents critique failures and improve via episodic memory — GitHub: noahshinn/reflexion
Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., 2023) — arXiv:2303.17651 — NeurIPS 2023 — Single model generates, critiques, revises iteratively
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (Gou et al., 2023) — arXiv:2305.11738 — ICLR 2024 — Grounds verification in tool use for correction
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) — arXiv:2212.08073 — Uses principles to guide model critique and revision

Test-Time Compute & Search

Scaling Language Models Test-Time Compute Optimally (Snell et al., 2024) — arXiv:2408.03314 — Targeted test-time scaling outperforms larger models
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (Brown et al., 2024) — arXiv:2407.21787 — Coverage (fraction of problems solved by any sample) scales over four orders of magnitude with repeated sampling; majority voting plateaus in domains without automatic verifiers

RL-Trained Reasoning Models

OpenAI o1 & o3 (OpenAI, 2024-2025) — Technical report — Models trained with RL to produce long internal reasoning chains; near-human Olympiad math performance
DeepSeek-R1 (DeepSeek AI, 2025) — arXiv:2501.12948 — Open-source reasoning model with GRPO training; discovers reasoning from scratch
s1: Simple Test-Time Scaling of Small Models (Muennighoff et al., 2025) — arXiv:2501.19393 — Small models match o1 performance via curated hard problems

World Models for Planning

Reasoning via Planning (Hao et al., 2023) — arXiv:2305.14992 — Explicit world models for tree search in planning

Benchmarks Referenced

Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021) — GSM8K benchmark
Measuring Mathematical Problem Solving With the MATH Dataset (Hendrycks et al., 2021) — MATH benchmark
Evaluating Large Language Models Trained on Code (Chen et al., 2021) — HumanEval benchmark
ALFWorld: Bringing NLP and Embodied AI Together (Shridhar et al., 2021)
WebShop: Towards Scalable Real-World Web Interaction with Grounding (Yao et al., 2022)
Improving Multi-hop Question Answering by Learning Intermediate Supervision Signals (Trivedi et al., 2022) — MuSiQue benchmark
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them (Suzgun et al., 2022)
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning (Glazer et al., 2024)
The Measure of Intelligence (François Chollet, 2019) — ARC-AGI benchmark

Theoretical & Cognitive Foundations

Agentic Large Language Models: A Survey (Plaat et al., 2025) — arXiv:2503.23037 — References Kahneman’s Thinking, Fast and Slow dual-process theory connection to System 1 (fast) vs System 2 (slow) reasoning

Continue to Multi-Agent Systems →

Overview

Chain-of-Thought: The Foundation

Chain-of-Thought Prompting (2022)

Zero-Shot Chain-of-Thought (2022)

Self-Consistency (2022)

Least-to-Most Prompting (2022)

Structured Reasoning: Trees, Graphs, and Beyond

Tree of Thoughts (2023)

Graph of Thoughts (2023)

Algorithm of Thoughts (2023)

ReAct: Synergizing Reasoning and Acting (2023)

Planning Architectures

LLM+P: Combining LLMs with Classical Planning (2023)

Plan-and-Execute Agents (2023)

DEPS: Describe, Explain, Plan and Select (2023)

Inner Monologue: Embodied Reasoning via Language Feedback (2022)

Reasoning via Planning (RAP) (2023)

ADaPT: As-Needed Decomposition and Planning (2023)

Step-Back Prompting (2023)

Decomposed Prompting (2022)

Reflection & Self-Improvement

Reflexion: Language Agents with Verbal Reinforcement Learning (2023)

Self-Refine: Iterative Refinement with Self-Feedback (2023)

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (2023)

Constitutional AI / Self-Critique (2022)

Search-Based Planning

Monte Carlo Tree Search for LLM Reasoning (2024)

Large Language Monkeys: Sampling and Majority Vote (2024)

Reasoning Models: RL for Reasoning (2024-2025)

OpenAI o1 / o3 (2024-2025)

DeepSeek-R1 (2025)

s1: Simple Test-Time Scaling (2025)

World Models for Planning

Language Models as World Models (2023)

Benchmarks for Reasoning & Planning

Key Concepts & Taxonomy

Reasoning Paradigms

Planning Strategies

References

Chain-of-Thought & Reasoning Foundations

Structured Reasoning & Tree Search

Agent Reasoning Loop

Planning Architectures

Reflection & Self-Improvement

Test-Time Compute & Search

RL-Trained Reasoning Models

World Models for Planning

Benchmarks Referenced

Theoretical & Cognitive Foundations