Memory, Tools & Actions

How agents remember, act, and interact with the world

Overview

Three capabilities separate LLM agents from simple chatbots:

Memory — persisting and retrieving information beyond the context window
Tool use — calling external APIs, search, code execution
Actions — operating in digital environments (browsers, GUIs, terminals, robots)

Together, these form the “body” of an agent — the mechanisms through which an LLM interacts with the world outside of pure text generation.

Memory Systems

The Memory Taxonomy

Drawing from cognitive science, the agent memory literature converges on four types:

Type	Description	Example
Working memory	Current context window	The agent’s active prompt
Episodic memory	Timestamped records of experiences	“On Tuesday I searched for X and found Y”
Semantic memory	Facts about the world	Knowledge bases, entity stores
Procedural memory	Skills, how-to knowledge	Voyager’s skill library; code functions

The challenge: LLM context windows are finite. All four types must be managed, compressed, and retrieved efficiently.

MemGPT: Towards LLMs as Operating Systems (2023)

Packer et al. · arXiv:2310.08560

The seminal memory management paper. Inspired by OS virtual memory: defines fast (in-context) and slow (external storage) memory tiers. The LLM manages its own memory via function calls — writing to external storage, loading relevant pages back into context. Enables effectively unlimited context for long-form documents and multi-session conversations.

Key ideas: Interrupt-driven memory management; tiered storage; self-managed context; conversation state persistence
GitHub: cpacker/MemGPT → evolved into Letta (production memory agent platform)

Generative Agents: Memory Architecture (2023)

Park et al. · arXiv:2304.03442

The Smallville paper introduced the most influential episodic memory architecture. Agents record all observations as natural language memories; a reflection mechanism periodically synthesizes these into higher-level insights; a planning module uses reflections to generate daily schedules.

Retrieval: Scores memories by recency, importance, and relevance — returns weighted combination
Reflection: “What 5 high-level insights can I infer from my recent memories?” — creates semantic memories from episodic ones
Impact: This memory architecture is widely used/adapted in subsequent agent papers

ReadAgent: Gist Memory for Long Documents (2024)

Lee et al. (Google) · arXiv:2402.09727

Extends effective context window 3.5–20× by mimicking human reading. Divides documents into episodes → compresses each into a “gist memory” → retrieves raw episode content on demand for specific questions.

Outperforms retrieval baselines on long-document QA tasks

Cognitive Architectures for Language Agents (2023)

Sumers et al. · arXiv:2309.02427

Grounds agent memory in cognitive science (ACT-R, SOAR). Proposes the four-memory framework. Essential reading for understanding memory design from first principles.

mem0: Universal Memory Layer (2024-2025)

mem0ai · github.com/mem0ai/mem0 · mem0.ai · Y Combinator-backed

The leading open-source production memory layer for AI agents. Provides multi-level memory (User, Session, Agent state) with adaptive personalization. Key benchmarks vs. full-context and OpenAI Memory:

+26% accuracy over OpenAI Memory on the LOCOMO benchmark (arXiv:2504.19413)
91% lower p95 latency than full-context approaches (arXiv:2504.19413)
90% lower token usage than full-context (critical cost reduction) (arXiv:2504.19413)

Core capabilities: remembers user preferences, adapts to individual needs over time, provides cross-platform SDKs (Python, TypeScript) and a managed cloud service. v1.0.0 release (2025) added improved vector store support and enhanced GCP integration.

GitHub: mem0ai/mem0 — 27k+ stars (as of early 2025)

Letta (formerly MemGPT) (2024-2025)

letta-ai · github.com/letta-ai/letta · letta.com

MemGPT evolved into Letta — a full platform for building stateful agents with advanced memory. Where the original MemGPT was a research prototype demonstrating OS-inspired memory paging, Letta is a production-ready platform:

Letta Code — coding CLI agent with memory, skills, and subagents (open-source)
Letta API — stateful agent API for building agents into applications (Python + TypeScript SDKs)
Memory architecture — tiered memory blocks that are explicitly managed, inspectable, and modifiable

The key advance over vanilla MemGPT: Letta’s memory blocks are explicitly represented, not just paged in/out — agents can inspect and modify their own memory programmatically. This enables genuinely stateful agents that improve over time. Maintains a model leaderboard for agent capability evaluation.

GitHub: letta-ai/letta — 14k+ stars (as of early 2025)

ENGRAM (2025)

arXiv:2511.12960

Lightweight orchestration for conversational agents managing multiple memory types with normalized schemas. Production-oriented: focuses on memory governance and constraints.

Episodic Memory is the Missing Piece (2025)

arXiv:2502.06975

Argues episodic memory (timestamped experience records) is the most important and neglected component for long-term agents. Proposes consolidation of episodes into parametric/semantic knowledge.

Long-Context Models & Memory (2024-2025)

Models with 100k–2M token contexts (Llama 3.1 128k, Claude 3.5 200k, Gemini 2.0 2M) reduce but don’t eliminate memory management needs. Challenges remain:

Lost in the middle: Models attend poorly to information in the middle of long contexts
Cost: 1M token contexts are expensive per call
Episodic consolidation: Still needed for sessions spanning days/weeks

Tool Use & Function Calling

Toolformer (2023)

Schick et al. (Meta) · arXiv:2302.04761

Trained models to insert API calls into their own text generation, self-supervised. The model learns to decide which tool to call, when, and with what arguments — from minimal human annotation.

Tools: Calculator, Wikipedia search, Q&A, translation, calendar
Legacy: Prefigures the function-calling paradigm now standard everywhere

OpenAI Function Calling (Jun 2023)

OpenAI

The most consequential practical milestone in agent tooling. Released June 2023, it gave developers a reliable, structured way to connect LLMs to tools via JSON schemas. Became the de facto standard — adopted by Anthropic (tool use), Google (function calling), and all major frameworks.

Enabled the entire LangChain/AutoGen/CrewAI ecosystem to mature rapidly

Gorilla: Large Language Model Connected with Massive APIs (2023)

Patil et al. · arXiv:2305.15334

Fine-tuned LLMs specifically for API calling with reduced hallucination. Introduced APIBench for evaluation. Showed that fine-tuning for tool use substantially outperforms prompting alone.

GitHub: ShishirPatil/gorilla

ToolLLM (2023)

Qin et al. · arXiv:2307.16789

Framework for training LLMs to use 16,000+ real-world REST APIs. Uses DFSDT (Depth-First Search-based Decision Tree) for multi-step tool selection. ToolBench dataset and ToolEval benchmark.

HuggingGPT / JARVIS (2023)

Shen et al. · arXiv:2303.17580

LLM as orchestrator of hundreds of Hugging Face models — treating each ML model as a “tool.” Demonstrated multi-modal capability through composition.

AnyTool: Self-Reflective, Hierarchical API Selection (2024)

arXiv:2402.04253

Hierarchical retrieval + self-reflection for selecting the right API from thousands of candidates.

CodeAct: Code as Actions (2024)

Wang et al. · arXiv:2402.01030 · ICML 2024

Unifies tool use and reasoning by making code the action space. Agents write Python/bash scripts to accomplish tasks; execution feedback enables iterative correction. More expressive than discrete tool calls.

Large Language Models as Tool Makers (2023)

Cai et al. · arXiv:2305.17126

Agents don’t just use tools — they create new tools as reusable functions for future use. Tool creation compounds capability over time.

Action Spaces

Web Browsing Agents

WebGPT (2021) · arXiv:2112.09332
First major browser-using agent. GPT-3 fine-tuned with RLHF to browse and answer questions.

Mind2Web (2023) · arXiv:2306.06070
Dataset of 2,350 web tasks across 137 websites. Introduced generalist web agent challenge.

WebArena (2023) · arXiv:2307.13854
5 functional websites (e-commerce/OneStopShop, Reddit, GitLab, Map, Wikipedia), 812 realistic tasks. Baseline: ~14% for GPT-4.

WebAgent (Google, 2023) · arXiv:2307.12856
Modular approach: HTML summarization + planning + grounded execution.

GUI & Computer Use Agents

CogAgent (2024) · arXiv:2312.08914
Visual language model fine-tuned for GUI understanding and navigation.

AppAgent (2023) · arXiv:2312.13771
Smartphone GUI agent using screenshots + XML accessibility trees.

UFO / UFO² (Microsoft) · arXiv:2402.07939
Windows desktop agent using UI Automation framework.

OS-Copilot (2024) · arXiv:2402.07456
General computer control agent.

OSWorld (2024) · arXiv:2404.07972
Benchmark for desktop GUI tasks across real apps. Agents started <10%; specialized models reached 72.5% by late 2024.

Anthropic Computer Use (Oct 2024) — anthropic.com/news/introducing-computer-use
First major LLM provider to offer native computer control. Spawned the browser-use ecosystem.

Code Execution Agents

OpenHands / OpenDevin — Full dev environment with terminal, browser, editor
SWE-agent — GitHub issue resolution with purpose-built ACI
Aider — Human-in-the-loop coding with git integration

Embodied / Robotics Agents

SayCan (2022) · arXiv:2204.01691
LLM proposes actions; learned affordance function scores physical feasibility. “What is both useful AND doable?”

PaLM-E (2023) · arXiv:2303.03378
562B parameter embodied multimodal language model. Directly ingests sensor data.

RT-2 (Google, 2023) · arXiv:2307.15818
Vision-language-action model. Web-scale training + robot data; emergent robot generalization.

AgentFS: The Filesystem for Agents (2025)

Turso · github.com/tursodatabase/agentfs

A SQLite-based filesystem explicitly designed for agent state management. See Community & Independent Agents → for full coverage.

Every file operation recorded in SQLite → full audit trail
Snapshot and restore agent state at any point
Single-file portability (the entire agent runtime is one .db file)

Retrieval-Augmented Agents

RAG is the baseline approach for giving agents access to large knowledge bases.

Self-RAG (2023)

Asai et al. · arXiv:2310.11511

Agent decides when to retrieve (not every query needs retrieval), generates reflective tokens to evaluate retrieved passages, and critiques its own outputs. More efficient and accurate than naive RAG.

Agentic RAG Patterns

As agent frameworks matured in 2024, several RAG patterns emerged:

Query planning: Decompose a complex query into sub-queries before retrieval
Iterative retrieval: Retrieve → read → identify gaps → retrieve again
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed it, retrieve similar documents

References

Memory Systems Papers

MemGPT: Towards LLMs as Operating Systems (Packer et al., 2023) — arXiv:2310.08560
Generative Agents: Interactive Simulacra of Human Behavior (Park et al., 2023) — arXiv:2304.03442
Cognitive Architectures for Language Agents (Sumers et al., 2023) — arXiv:2309.02427
ReadAgent: Gist Memory for Long Documents (Lee et al., 2024) — arXiv:2402.09727
ENGRAM: Efficient Orchestration of LLM Agents with Multi-Memory Types (2025) — arXiv:2511.12960
Episodic Memory is the Missing Piece in Long-Context Language Agents (2025) — arXiv:2502.06975

Tool Use & Function Calling Papers

Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) — arXiv:2302.04761
Gorilla: Large Language Model Connected with Massive APIs (Patil et al., 2023) — arXiv:2305.15334
ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-world APIs (Qin et al., 2023) — arXiv:2307.16789
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face (Shen et al., 2023) — arXiv:2303.17580
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls (2024) — arXiv:2402.04253
Executable Code Actions Elicit Better LLM Agents (Wang et al., 2024) — arXiv:2402.01030
Large Language Models as Tool Makers (Cai et al., 2023) — arXiv:2305.17126

Web Browsing Agents

WebGPT: Browser-Assisted Question-Answering with Human Feedback (2021) — arXiv:2112.09332
Mind2Web: Towards a Generalist Agent for the Web (2023) — arXiv:2306.06070
WebArena: A Realistic Web Environment for Building Autonomous Agents (Zhou et al., 2023) — arXiv:2307.13854
Towards Autonomous Web Agent (Google, 2023) — arXiv:2307.12856

GUI, Desktop & Computer Use Agents

CogAgent: A Visual Language Model for GUI Agents (2024) — arXiv:2312.08914
AppAgent: Multimodal Agents as Smartphone Users (2023) — arXiv:2312.13771
UFO: A UI-Focused Agent for Windows OS Interaction (Microsoft, 2024) — arXiv:2402.07939
OS-Copilot: Towards Generalist Computer Agents with Open-Ended Goals and Open-Platform Actions (2024) — arXiv:2402.07456
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Xie et al., 2024) — arXiv:2404.07972

Embodied & Robotics Agents

Do as I can, not as I say: Grounding Language in Robotic Affordances (Ahn et al., 2022) — arXiv:2204.01691
PaLM-E: An Embodied Multimodal Language Model (2023) — arXiv:2303.03378
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (Google, 2023) — arXiv:2307.15818

Retrieval-Augmented Agents

Self-RAG: Learning to Retrieve, Generate, and Critique for Self-Improved Generation (Asai et al., 2023) — arXiv:2310.11511

Memory & Agent Platforms

mem0 — github.com/mem0ai/mem0 · mem0.ai
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory (mem0ai, 2025) — arXiv:2504.19413 (+26% accuracy, 91% lower latency, 90% fewer tokens vs full-context on LOCOMO)
Letta (formerly MemGPT) — github.com/letta-ai/letta · letta.com

Agent Filesystem & Infrastructure

AgentFS: The Filesystem for Agents (Turso, 2025) — github.com/tursodatabase/agentfs

API Standards & Milestones

OpenAI Function Calling (June 2023) — De facto standard for structured tool use in LLM agents
Anthropic Computer Use (Oct 2024) — anthropic.com/news/introducing-computer-use

Full chronology in the Timeline →. Continue to 2024–2026 Frontier →