Topics
Deep dives into cross-cutting themes in LLM agent research
Three important themes run across the entire LLM agent field but don’t fit neatly into any single architectural category. This section gives them their own space.
Evaluation & Benchmarks
How do we measure agent capability? SWE-bench, WebArena, GAIA, OSWorld, AgentBench, METR time horizons, and the ongoing challenge of benchmarks that don’t saturate or overfit.
Safety & Alignment
Prompt injection, trust hierarchies, sandboxing, reversibility, the transparency gap, and what responsible agent deployment looks like in practice.
Science & Research Agents
FutureHouse Platform, Google AI Co-Scientist, PaperCoder, SkyRL, METR, and the frontier of agents built specifically for scientific discovery.
Personalization & Digital Twins
How agents learn about users, adopt personas, and represent identities — plus digital twins, evaluation benchmarks (LaMP, PersonaGym), and adversarial attacks including DAN jailbreaks, memory poisoning, and system prompt leakage.
Infrastructure & Protocols
MCP, A2A, ACP, OpenAI Agents SDK — the protocols and frameworks that connect agents to tools and each other. Plus tool registries, security considerations, and deployment patterns.
Human-Agent Interaction & Trust
How humans work with autonomous agents — trust calibration, delegation patterns, UX of agentic systems, human-in-the-loop design, and the emerging world of ambient background agents.
Agent Economics
Cost per task, token efficiency, budget-constrained execution, model routing & cascading (FrugalGPT, RouteLLM), prompt compression, and enterprise ROI — the economics of making agents affordable at scale.
Coding Agents
From autocomplete to autonomous software engineering — SWE-agent, Devin, Claude Code, Cursor, OpenHands, benchmarks (SWE-bench, HumanEval, LiveCodeBench), agent-computer interfaces, and the architecture of edit-test-debug loops.
Observability & Robustness
Making agent behavior interpretable, debuggable, and reliable — tracing infrastructure (OpenTelemetry, Langfuse, LangSmith), failure analysis, chain-of-thought faithfulness, guardrails, and production monitoring.
Long-Horizon Autonomy
Agents that work for hours, days, or indefinitely — error accumulation, memory architectures for sustained operation, hierarchical planning, METR time-horizon evaluations, and the autonomy spectrum.
These topics are connected. Evaluation shapes what safety problems we can measure. Safety constrains what autonomy levels are responsible. Scientific agents push the boundary of what evaluation even means — when the “correct” answer is a novel hypothesis, how do you grade it?
Each page collects primary sources, key papers, and synthesis — use them as reference material rather than introductions.