Topics

Deep dives into cross-cutting themes in LLM agent research

Three important themes run across the entire LLM agent field but don’t fit neatly into any single architectural category. This section gives them their own space.

📊

Evaluation & Benchmarks

How do we measure agent capability? SWE-bench, WebArena, GAIA, OSWorld, AgentBench, METR time horizons, and the ongoing challenge of benchmarks that don’t saturate or overfit.

🛡️

Safety & Alignment

Prompt injection, trust hierarchies, sandboxing, reversibility, the transparency gap, and what responsible agent deployment looks like in practice.

🔬

Science & Research Agents

FutureHouse Platform, Google AI Co-Scientist, PaperCoder, SkyRL, METR, and the frontier of agents built specifically for scientific discovery.

🪞

Personalization & Digital Twins

How agents learn about users, adopt personas, and represent identities — plus digital twins, evaluation benchmarks (LaMP, PersonaGym), and adversarial attacks including DAN jailbreaks, memory poisoning, and system prompt leakage.

🔧

Infrastructure & Protocols

MCP, A2A, ACP, OpenAI Agents SDK — the protocols and frameworks that connect agents to tools and each other. Plus tool registries, security considerations, and deployment patterns.

🤝

Human-Agent Interaction & Trust

How humans work with autonomous agents — trust calibration, delegation patterns, UX of agentic systems, human-in-the-loop design, and the emerging world of ambient background agents.

💰

Agent Economics

Cost per task, token efficiency, budget-constrained execution, model routing & cascading (FrugalGPT, RouteLLM), prompt compression, and enterprise ROI — the economics of making agents affordable at scale.

💻

Coding Agents

From autocomplete to autonomous software engineering — SWE-agent, Devin, Claude Code, Cursor, OpenHands, benchmarks (SWE-bench, HumanEval, LiveCodeBench), agent-computer interfaces, and the architecture of edit-test-debug loops.

🔍

Observability & Robustness

Making agent behavior interpretable, debuggable, and reliable — tracing infrastructure (OpenTelemetry, Langfuse, LangSmith), failure analysis, chain-of-thought faithfulness, guardrails, and production monitoring.

🏔️

Long-Horizon Autonomy

Agents that work for hours, days, or indefinitely — error accumulation, memory architectures for sustained operation, hierarchical planning, METR time-horizon evaluations, and the autonomy spectrum.


These topics are connected. Evaluation shapes what safety problems we can measure. Safety constrains what autonomy levels are responsible. Scientific agents push the boundary of what evaluation even means — when the “correct” answer is a novel hypothesis, how do you grade it?

Each page collects primary sources, key papers, and synthesis — use them as reference material rather than introductions.