Memory, Tools & Actions

How agents remember, act, and interact with the world

Overview

Three capabilities separate LLM agents from simple chatbots:

  1. Memory — persisting and retrieving information beyond the context window
  2. Tool use — calling external APIs, search, code execution
  3. Actions — operating in digital environments (browsers, GUIs, terminals, robots)

Together, these form the “body” of an agent — the mechanisms through which an LLM interacts with the world outside of pure text generation.


Memory Systems

The Memory Taxonomy

Drawing from cognitive science, the agent memory literature converges on four types:

Type Description Example
Working memory Current context window The agent’s active prompt
Episodic memory Timestamped records of experiences “On Tuesday I searched for X and found Y”
Semantic memory Facts about the world Knowledge bases, entity stores
Procedural memory Skills, how-to knowledge Voyager’s skill library; code functions

The challenge: LLM context windows are finite. All four types must be managed, compressed, and retrieved efficiently.

MemGPT: Towards LLMs as Operating Systems (2023)

Packer et al. · arXiv:2310.08560

The seminal memory management paper. Inspired by OS virtual memory: defines fast (in-context) and slow (external storage) memory tiers. The LLM manages its own memory via function calls — writing to external storage, loading relevant pages back into context. Enables effectively unlimited context for long-form documents and multi-session conversations.

  • Key ideas: Interrupt-driven memory management; tiered storage; self-managed context; conversation state persistence
  • GitHub: cpacker/MemGPT → evolved into Letta (production memory agent platform)

Generative Agents: Memory Architecture (2023)

Park et al. · arXiv:2304.03442

The Smallville paper introduced the most influential episodic memory architecture. Agents record all observations as natural language memories; a reflection mechanism periodically synthesizes these into higher-level insights; a planning module uses reflections to generate daily schedules.

  • Retrieval: Scores memories by recency, importance, and relevance — returns weighted combination
  • Reflection: “What 5 high-level insights can I infer from my recent memories?” — creates semantic memories from episodic ones
  • Impact: This memory architecture is widely used/adapted in subsequent agent papers

ReadAgent: Gist Memory for Long Documents (2024)

Lee et al. (Google) · arXiv:2402.09727

Extends effective context window 3.5–20× by mimicking human reading. Divides documents into episodes → compresses each into a “gist memory” → retrieves raw episode content on demand for specific questions.

  • Outperforms retrieval baselines on long-document QA tasks

Cognitive Architectures for Language Agents (2023)

Sumers et al. · arXiv:2309.02427

Grounds agent memory in cognitive science (ACT-R, SOAR). Proposes the four-memory framework. Essential reading for understanding memory design from first principles.

mem0: Universal Memory Layer (2024-2025)

mem0ai · github.com/mem0ai/mem0 · mem0.ai · Y Combinator-backed

The leading open-source production memory layer for AI agents. Provides multi-level memory (User, Session, Agent state) with adaptive personalization. Key benchmarks vs. full-context and OpenAI Memory:

  • +26% accuracy over OpenAI Memory on the LOCOMO benchmark (source needed)
  • 91% faster responses than full-context approaches (source needed)
  • 90% lower token usage than full-context (critical cost reduction) (source needed)

Core capabilities: remembers user preferences, adapts to individual needs over time, provides cross-platform SDKs (Python, TypeScript) and a managed cloud service. v1.0.0 release (2025) added improved vector store support and enhanced GCP integration.

GitHub: mem0ai/mem0 — 27k+ stars (as of early 2025)

Letta (formerly MemGPT) (2024-2025)

letta-ai · github.com/letta-ai/letta · letta.com

MemGPT evolved into Letta — a full platform for building stateful agents with advanced memory. Where the original MemGPT was a research prototype demonstrating OS-inspired memory paging, Letta is a production-ready platform:

  • Letta Code — coding CLI agent with memory, skills, and subagents (open-source)
  • Letta API — stateful agent API for building agents into applications (Python + TypeScript SDKs)
  • Memory architecture — tiered memory blocks that are explicitly managed, inspectable, and modifiable

The key advance over vanilla MemGPT: Letta’s memory blocks are explicitly represented, not just paged in/out — agents can inspect and modify their own memory programmatically. This enables genuinely stateful agents that improve over time. Maintains a model leaderboard for agent capability evaluation.

GitHub: letta-ai/letta — 14k+ stars (as of early 2025)

ENGRAM (2025)

arXiv:2511.12960

Lightweight orchestration for conversational agents managing multiple memory types with normalized schemas. Production-oriented: focuses on memory governance and constraints.

Episodic Memory is the Missing Piece (2025)

arXiv:2502.06975

Argues episodic memory (timestamped experience records) is the most important and neglected component for long-term agents. Proposes consolidation of episodes into parametric/semantic knowledge.

Long-Context Models & Memory (2024-2025)

Models with 100k–2M token contexts (Llama 3.1 128k, Claude 3.5 200k, Gemini 2.0 2M) reduce but don’t eliminate memory management needs. Challenges remain:

  • Lost in the middle: Models attend poorly to information in the middle of long contexts
  • Cost: 1M token contexts are expensive per call
  • Episodic consolidation: Still needed for sessions spanning days/weeks

Tool Use & Function Calling

Toolformer (2023)

Schick et al. (Meta) · arXiv:2302.04761

Trained models to insert API calls into their own text generation, self-supervised. The model learns to decide which tool to call, when, and with what arguments — from minimal human annotation.

  • Tools: Calculator, Wikipedia search, Q&A, translation, calendar
  • Legacy: Prefigures the function-calling paradigm now standard everywhere

OpenAI Function Calling (Jun 2023)

OpenAI

The most consequential practical milestone in agent tooling. Released June 2023, it gave developers a reliable, structured way to connect LLMs to tools via JSON schemas. Became the de facto standard — adopted by Anthropic (tool use), Google (function calling), and all major frameworks.

  • Enabled the entire LangChain/AutoGen/CrewAI ecosystem to mature rapidly

Gorilla: Large Language Model Connected with Massive APIs (2023)

Patil et al. · arXiv:2305.15334

Fine-tuned LLMs specifically for API calling with reduced hallucination. Introduced APIBench for evaluation. Showed that fine-tuning for tool use substantially outperforms prompting alone.

ToolLLM (2023)

Qin et al. · arXiv:2307.16789

Framework for training LLMs to use 16,000+ real-world REST APIs. Uses DFSDT (Depth-First Search-based Decision Tree) for multi-step tool selection. ToolBench dataset and ToolEval benchmark.

HuggingGPT / JARVIS (2023)

Shen et al. · arXiv:2303.17580

LLM as orchestrator of hundreds of Hugging Face models — treating each ML model as a “tool.” Demonstrated multi-modal capability through composition.

AnyTool: Self-Reflective, Hierarchical API Selection (2024)

arXiv:2402.04253

Hierarchical retrieval + self-reflection for selecting the right API from thousands of candidates.

CodeAct: Code as Actions (2024)

Wang et al. · arXiv:2402.01030 · ICML 2024

Unifies tool use and reasoning by making code the action space. Agents write Python/bash scripts to accomplish tasks; execution feedback enables iterative correction. More expressive than discrete tool calls.

Large Language Models as Tool Makers (2023)

Cai et al. · arXiv:2305.17126

Agents don’t just use tools — they create new tools as reusable functions for future use. Tool creation compounds capability over time.


Action Spaces

Web Browsing Agents

WebGPT (2021) · arXiv:2112.09332
First major browser-using agent. GPT-3 fine-tuned with RLHF to browse and answer questions.

Mind2Web (2023) · arXiv:2306.06070
Dataset of 2,350 web tasks across 137 websites. Introduced generalist web agent challenge.

WebArena (2023) · arXiv:2307.13854
5 functional websites (e-commerce/OneStopShop, Reddit, GitLab, Map, Wikipedia), 812 realistic tasks. Baseline: ~14% for GPT-4.

WebAgent (Google, 2023) · arXiv:2307.12856
Modular approach: HTML summarization + planning + grounded execution.

GUI & Computer Use Agents

CogAgent (2024) · arXiv:2312.08914
Visual language model fine-tuned for GUI understanding and navigation.

AppAgent (2023) · arXiv:2312.13771
Smartphone GUI agent using screenshots + XML accessibility trees.

UFO / UFO² (Microsoft) · arXiv:2402.07939
Windows desktop agent using UI Automation framework.

OS-Copilot (2024) · arXiv:2402.07456
General computer control agent.

OSWorld (2024) · arXiv:2404.07972
Benchmark for desktop GUI tasks across real apps. Agents started <10%; specialized models reached 72.5% by late 2024.

Anthropic Computer Use (Oct 2024) — anthropic.com/news/introducing-computer-use
First major LLM provider to offer native computer control. Spawned the browser-use ecosystem.

Code Execution Agents

OpenHands / OpenDevin — Full dev environment with terminal, browser, editor
SWE-agent — GitHub issue resolution with purpose-built ACI
Aider — Human-in-the-loop coding with git integration

Embodied / Robotics Agents

SayCan (2022) · arXiv:2204.01691
LLM proposes actions; learned affordance function scores physical feasibility. “What is both useful AND doable?”

PaLM-E (2023) · arXiv:2303.03378
562B parameter embodied multimodal language model. Directly ingests sensor data.

RT-2 (Google, 2023) · arXiv:2307.15818
Vision-language-action model. Web-scale training + robot data; emergent robot generalization.

AgentFS: The Filesystem for Agents (2025)

Turso · github.com/tursodatabase/agentfs

A SQLite-based filesystem explicitly designed for agent state management. See Community & Independent Agents → for full coverage.

  • Every file operation recorded in SQLite → full audit trail
  • Snapshot and restore agent state at any point
  • Single-file portability (the entire agent runtime is one .db file)

Retrieval-Augmented Agents

RAG is the baseline approach for giving agents access to large knowledge bases.

Self-RAG (2023)

Asai et al. · arXiv:2310.11511

Agent decides when to retrieve (not every query needs retrieval), generates reflective tokens to evaluate retrieved passages, and critiques its own outputs. More efficient and accurate than naive RAG.

Agentic RAG Patterns

As agent frameworks matured in 2024, several RAG patterns emerged:

  • Query planning: Decompose a complex query into sub-queries before retrieval
  • Iterative retrieval: Retrieve → read → identify gaps → retrieve again
  • HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed it, retrieve similar documents

References

Memory Systems Papers

  • MemGPT: Towards LLMs as Operating Systems (Packer et al., 2023) — arXiv:2310.08560
  • Generative Agents: Interactive Simulacra of Human Behavior (Park et al., 2023) — arXiv:2304.03442
  • Cognitive Architectures for Language Agents (Sumers et al., 2023) — arXiv:2309.02427
  • ReadAgent: Gist Memory for Long Documents (Lee et al., 2024) — arXiv:2402.09727
  • ENGRAM: Efficient Orchestration of LLM Agents with Multi-Memory Types (2025) — arXiv:2511.12960
  • Episodic Memory is the Missing Piece in Long-Context Language Agents (2025) — arXiv:2502.06975

Tool Use & Function Calling Papers

  • Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) — arXiv:2302.04761
  • Gorilla: Large Language Model Connected with Massive APIs (Patil et al., 2023) — arXiv:2305.15334
  • ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-world APIs (Qin et al., 2023) — arXiv:2307.16789
  • HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face (Shen et al., 2023) — arXiv:2303.17580
  • AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls (2024) — arXiv:2402.04253
  • Executable Code Actions Elicit Better LLM Agents (Wang et al., 2024) — arXiv:2402.01030
  • Large Language Models as Tool Makers (Cai et al., 2023) — arXiv:2305.17126

Web Browsing Agents

  • WebGPT: Browser-Assisted Question-Answering with Human Feedback (2021) — arXiv:2112.09332
  • Mind2Web: Towards a Generalist Agent for the Web (2023) — arXiv:2306.06070
  • WebArena: A Realistic Web Environment for Building Autonomous Agents (Zhou et al., 2023) — arXiv:2307.13854
  • Towards Autonomous Web Agent (Google, 2023) — arXiv:2307.12856

GUI, Desktop & Computer Use Agents

  • CogAgent: A Visual Language Model for GUI Agents (2024) — arXiv:2312.08914
  • AppAgent: Multimodal Agents as Smartphone Users (2023) — arXiv:2312.13771
  • UFO: A UI-Focused Agent for Windows OS Interaction (Microsoft, 2024) — arXiv:2402.07939
  • OS-Copilot: Towards Generalist Computer Agents with Open-Ended Goals and Open-Platform Actions (2024) — arXiv:2402.07456
  • OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Xie et al., 2024) — arXiv:2404.07972

Embodied & Robotics Agents

  • Do as I can, not as I say: Grounding Language in Robotic Affordances (Ahn et al., 2022) — arXiv:2204.01691
  • PaLM-E: An Embodied Multimodal Language Model (2023) — arXiv:2303.03378
  • RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (Google, 2023) — arXiv:2307.15818

Retrieval-Augmented Agents

  • Self-RAG: Learning to Retrieve, Generate, and Critique for Self-Improved Generation (Asai et al., 2023) — arXiv:2310.11511

Memory & Agent Platforms

Agent Filesystem & Infrastructure

API Standards & Milestones


Full chronology in the Timeline →. Continue to 2024–2026 Frontier →