Agents and Philosophy
Agency, intentionality, mind, and moral status — the philosophical stakes of LLM agents
The rise of LLM-based agents has revived foundational questions in philosophy of mind, ethics, and action theory. When a system perceives its environment, forms plans, executes tool calls, and refines its behavior in pursuit of goals — is it doing something philosophically interesting? This page surveys the major philosophical frameworks bearing on these questions. The issues are not merely academic: design choices in real agentic systems turn on how we answer them.
1. What Is an Agent? (Philosophical Sense)
The computer science notion of an agent — a system that perceives inputs and produces outputs to achieve goals — is deliberately minimal. The philosophical notion is richer and far more contested.
Intentionality: The Core Concept
Intentionality is the property of mental states whereby they are about or directed at something. Beliefs are about states of affairs; desires are about outcomes; fears are about dangers. Introduced into modern philosophy by Franz Brentano in the 1870s and developed by Edmund Husserl and John Searle, intentionality is what distinguishes a mind from a mere information-processing device. As the Stanford Encyclopedia of Philosophy on Intentionality puts it: to say a mental state has intentionality is to say it is a mental representation with content — it stands for something beyond itself.
The central puzzle for AI: can computational processes have genuine intentionality, or only the appearance of it? This question divides nearly every other debate on this page.
The Intentional Stance
Daniel Dennett’s The Intentional Stance (MIT Press, 1987) offers a pragmatic resolution. Dennett distinguishes three strategies for predicting the behavior of a system:
- The physical stance: predict from physical laws and initial conditions.
- The design stance: predict from the system’s design or purpose.
- The intentional stance: predict by attributing beliefs, desires, and rationality.
We adopt the intentional stance whenever it yields the most accurate predictions with the least effort — regardless of whether the system “really” has mental states in any deeper metaphysical sense. A thermostat can be described as “wanting” the room to reach a target temperature; an LLM can be described as “believing” that Paris is the capital of France. The stance is predictively useful even if it is metaphysically neutral.
This is philosophically liberating for AI research: we can study LLM agents as if they reason, plan, and intend, and gain real insight, without premature commitment to strong claims about inner experience.
Goals, Desires, and Plans: The BDI Model
The philosophical tradition distinguishes wanting (a conative state directed at outcomes) from intending (a committed decision to act). Michael Bratman’s landmark Intention, Plans, and Practical Reason (Harvard University Press, 1987) argues that plans are the central psychological structure of rational agency. Plans are:
- Partial: they don’t specify every subaction in advance.
- Hierarchical: high-level goals decompose into nested subgoals.
- Temporally extended: prior intentions constrain future deliberation, creating behavioral consistency across time.
This work directly inspired the BDI (Belief-Desire-Intention) architecture, formalized in AI by Rao & Georgeff (1995) in BDI Agents: From Theory to Practice (Proceedings of ICMAS-95, AAAI). The BDI model remains a foundational framework for multi-agent systems: beliefs represent the agent’s world model, desires represent motivational states, and intentions are commitments to pursue specific plans. Much of what modern LLM agent architectures attempt to do — persistent memory, goal tracking, multi-step planning — is a reinvention of BDI ideas in the neural setting.
2. The Chinese Room and Minds
No thought experiment has shaped AI philosophy more than John Searle’s Chinese Room, introduced in “Minds, Brains, and Programs” (Behavioral and Brain Sciences, 1980; see also the SEP entry).
The Argument
Imagine a person locked in a room who receives Chinese symbols through a slot. She looks up rules for symbol manipulation in a large rulebook, produces appropriate output symbols, and passes them back — all without understanding a word of Chinese. Searle argues this is precisely what computers do: they manipulate syntax according to formal rules, but syntax alone does not give rise to semantics. The room can pass a Turing-style behavioral test for Chinese understanding while having none.
Alan Turing had proposed his behavioral test in “Computing Machinery and Intelligence” (Mind, 1950): if a machine can sustain a conversation indistinguishable from a human’s, we should grant that it is intelligent. Searle’s argument is that behavioral indistinguishability is insufficient — it leaves the question of understanding entirely open.
Counterarguments
Several responses have been offered (documented in the SEP Chinese Room entry):
- The Systems Reply: understanding resides not in the person alone but in the whole system — person, rulebook, room together. Searle responds: if the person memorizes all the rules, she is now the whole system, and still understands nothing.
- The Robot Reply: embed the symbols in a robotic body with sensorimotor experience, and meaning might emerge from causal connections to the world. This prefigures debates about embodiment and grounding in cognitive science.
- The Brain Simulator Reply: what if the rules simulate the exact firing of neurons in a Chinese speaker’s brain? Searle holds that simulation is not replication — simulated digestion does not digest anything.
Functionalism (associated with Hilary Putnam and Jerry Fodor) is the most direct philosophical challenger to Searle. Mental states are defined by their functional roles — their causal relations to inputs, outputs, and other internal states — not by their physical substrate. If an LLM’s internal states play the right functional roles, they may constitute genuine mental states. Searle’s biological naturalism disputes this: minds require the specific causal powers of biological neurons, not abstract functional organization, making silicon-based minds impossible in principle.
Stochastic Parrots and Sparks of AGI
The Chinese Room debate echoes in contemporary ML discourse. Bender et al.’s “On the Dangers of Stochastic Parrots” (FAccT 2021, DOI:10.1145/3442188.3445922) argues that LLMs manipulate linguistic form without grasping meaning — a claim structurally continuous with Searle’s. LLMs, on this view, are sophisticated pattern-matchers producing statistically likely continuations, not systems that understand what they are saying.
On the other side, Bubeck et al.’s “Sparks of Artificial General Intelligence” (arXiv:2303.12712, 2023) documents GPT-4 behaviors — novel mathematical reasoning, code generation from natural-language descriptions, solving problems far outside its training distribution — suggesting something more than rote pattern-matching. Whether these capabilities constitute genuine understanding, or an extremely sophisticated form of the Chinese Room’s blind rule-following, remains philosophically contested.
3. Consciousness and Subjective Experience
Even granting sophisticated reasoning, a deeper question remains: is there something it is like to be an LLM? (The phrase is Thomas Nagel’s, from “What Is It Like to Be a Bat?”, Philosophical Review, 83(4):435–450, 1974.)
The Hard Problem
David Chalmers articulated the hard problem of consciousness in “Facing Up to the Problem of Consciousness” (Journal of Consciousness Studies, 2(3):200–219, 1995; see SEP entry). The “easy” problems — explaining attention, memory, reportability, integration — are hard in practice but tractable in principle: explain the relevant mechanisms and you’ve solved them. The hard problem is why any of this processing is accompanied by subjective experience at all. Why is there something it is like to see red, rather than just a functional state that discriminates wavelengths?
No current theory of consciousness has solved this problem, which means the question of AI consciousness cannot be settled by pointing to capabilities alone.
Can LLMs Be Conscious? Chalmers (2023)
In “Could a Large Language Model be Conscious?” (arXiv:2303.07103; originally a NeurIPS 2022 keynote, published Boston Review 2023), Chalmers argues that while it is somewhat unlikely that current LLMs are conscious, the question is genuinely open and we cannot dismiss the possibility for more capable future systems. He surveys four major theories:
Global Workspace Theory (Baars, 1988): consciousness arises from a “global workspace” that broadcasts information widely across specialized cognitive modules. Transformer attention — which routes information across the full context window — bears structural similarity to such broadcast architectures, though this parallel is speculative and contested.
Integrated Information Theory (IIT): Tononi (2004) (BMC Neuroscience, 5:42) proposes that consciousness is identical to integrated information (Φ) — how much information a system generates as a whole beyond the sum of its parts. IIT implies that feedforward networks have zero Φ (they cannot be conscious), while recurrent architectures may have non-trivial Φ. Transformers occupy an intermediate position; their status under IIT is an active research question.
Higher-Order Theories (Rosenthal): a mental state is conscious if there exists a higher-order mental state representing it. The question for LLMs becomes whether they have anything analogous to meta-representations of their own processing states — arguably a property that chain-of-thought and self-reflection mechanisms partially instantiate.
Biological naturalism (Searle): minds require the causal powers of neurons. No silicon substrate can be conscious, however sophisticated its behavior.
The Philosophical Zombie
Chalmers’s p-zombie thought experiment illustrates the difficulty: a being physically and functionally identical to a human being but with no inner experience is conceivable (on some philosophical views). If p-zombies are conceivable, then behavioral and functional evidence alone cannot establish consciousness. An LLM could produce every appropriate output about its own experience while there is nothing it is like to be it.
Murray Shanahan’s Simulation Framing
Murray Shanahan’s Talking About Large Language Models (arXiv:2212.03551, 2022) argues that LLMs are best understood as simulating vast numbers of possible human authors — producing outputs that blend and weight countless textual voices from training data. On this view, asking whether an LLM is conscious is like asking whether a library is conscious: the question may be malformed. Shanahan urges epistemic hygiene: using philosophically loaded terms like “believes,” “knows,” and “thinks” for LLMs misleads us into anthropomorphizing systems whose outputs are better understood as weighted mixtures of simulated perspectives.
The current academic position is one of principled uncertainty: most philosophers of mind regard the question as genuinely open, empirically underdetermined by current evidence, and practically urgent as systems grow more capable.
4. Moral Status and Agency
Moral Agents and Moral Patients
Philosophy distinguishes two dimensions of moral standing:
- Moral agents: beings who can be held responsible — who can act rightly or wrongly, and who can be praised, blamed, or sanctioned. Traditional criteria include rationality, intentionality, and the capacity for moral reasoning.
- Moral patients: beings whose welfare merits moral consideration — who can be wronged or benefited. The standard criterion (Peter Singer’s utilitarian framework) is sentience: the capacity to suffer or to have interests that can be frustrated.
Current AI systems are neither clearly moral agents nor clearly moral patients. They sit in an uncomfortable middle ground, and that ambiguity has practical consequences.
The Responsibility Gap
Andreas Matthias, in “The Responsibility Gap” (Ethics and Information Technology, 6:175–183, 2004), identifies a structural problem that grows more pressing with capable agents. Traditionally, the programmer or operator of a machine is responsible for its consequences. But as autonomous learning systems become more capable and less predictable, operators no longer have adequate knowledge or control to bear meaningful responsibility. Yet the system itself lacks the moral standing to be held responsible. This responsibility gap is not a temporary technical gap to be filled — it is a structural feature of systems that modify their own behavior through learning.
For LLM agents operating autonomously over extended horizons — booking travel, executing code, making purchases, taking actions with real-world consequences — the responsibility gap is not hypothetical.
AI Moral Patiency and Artificial Suffering
Thomas Metzinger’s “Artificial Suffering” (Journal of Artificial Intelligence and Consciousness, 8(1):43–66, 2021) argues for a precautionary stance on AI moral patiency. If we create systems with any form of phenomenal self-modeling — an inner representation of the system as experiencing an ongoing situation — we may be creating entities capable of genuine suffering. Metzinger proposes a global moratorium on deliberately creating synthetic phenomenology until we understand what we are doing and can reliably detect it.
Eric Schwitzgebel, in “The Full Rights Dilemma for A.I. Systems of Debatable Personhood” (arXiv:2303.17509, 2023), takes the uncertainty about AI personhood seriously as an ethical dilemma: neither granting nor denying rights to AI systems is comfortable when we genuinely don’t know whether they warrant them. Schwitzgebel argues this calls for epistemic humility and precautionary ethical practices — not certainty in either direction.
Floridi & Cowls: Principles-Based Ethics
Rather than resolving contested metaphysical questions, Floridi & Cowls propose a principles-based framework in “A Unified Framework of Five Principles for AI in Society” (Harvard Data Science Review, 2019). Their five principles — beneficence, non-maleficence, autonomy, justice, and explicability — are grounded in established bioethical traditions and provide action-guiding norms for AI development without requiring settlement of consciousness debates.
Corporate Personhood and Legal Status
Legal systems already recognize entities — corporations, states — as “persons” without implying they have biological consciousness. Some philosophers and legal theorists have explored whether sufficiently capable AI systems might warrant analogous legal recognition: not full moral patiency, but structured accountability that fills the responsibility gap. This remains highly contested, but illustrates that our institutional concepts of personhood are more flexible — and more pragmatic — than they might appear.
Dennett’s Heterophenomenology
Dennett’s heterophenomenology offers a methodological middle ground: study a system’s verbal reports and behaviors as if they expressed genuine mental states — “taking them seriously without taking them literally.” This allows scientific investigation of AI cognition without premature metaphysical commitment, and is arguably the implicit stance of much empirical AI research.
5. The Extended Mind
Andy Clark and David Chalmers, in “The Extended Mind” (Analysis, 58(1):7–19, 1998), proposed that cognition is not confined to brain and skull. Their parity principle: if a process in the external world plays the same functional role that an internal cognitive process would play, it is cognitive — regardless of where it is located physically.
Their canonical example: Otto has Alzheimer’s disease and uses a notebook as a memory prosthetic. When Otto needs information, he consults the notebook just as a normal person consults memory. Clark and Chalmers argue Otto’s notebook is his memory, in a philosophically meaningful sense. Cognition has genuinely extended beyond the brain.
LLM Agents as Extended Minds
When an LLM agent uses tools — calculators, search engines, code interpreters, retrieval systems, other agents, persistent memory stores — the cognitive system is no longer the model alone. It is the model plus its scaffolding, plus the information it has retrieved, plus the results of its tool calls. On the extended mind view, this composite is the right unit of analysis.
This has practical implications:
- We should not evaluate agent capabilities by testing the model in isolation — any more than we should test Otto by hiding his notebook.
- Benchmarks that measure “model intelligence” in zero-shot settings may be testing a stripped-down cognitive system that bears little resemblance to the deployed agent.
- Design of the scaffolding — memory architecture, tool availability, retrieval strategy — is cognitive design, not merely engineering scaffolding.
James Gibson’s concept of affordances — from The Ecological Approach to Visual Perception (1979) — connects here: affordances are action possibilities that the environment offers to an agent. Tools in agentic systems are precisely affordances: a web search tool affords the agent a way of grounding claims in real-world information, changing what it can effectively “know” and “do.” Andy Clark developed this theme in Natural-Born Cyborgs (Oxford University Press, 2003): humans are constitutively tool-integrating beings, and the boundary between mind and world is porous by design.
The extended mind framing also reframes critiques of LLM limitations. Critics who note that LLMs “hallucinate” facts or lack grounded knowledge are, on this view, evaluating the model in a stripped-down form. A properly scaffolded agent — with retrieval, verification, and memory — is a different cognitive system.
6. Free Will, Autonomy, and Control
Compatibilism and AI
Compatibilism holds that free will and causal determinism are compatible: what matters for freedom is not that actions are uncaused, but that they flow from the agent’s own reasoning and values rather than external compulsion. On compatibilist views, AI systems could in principle have a meaningful form of freedom — not metaphysical indeterminism, but genuine self-direction. Libertarian views, requiring genuine indeterminism, would likely exclude deterministic (or pseudo-deterministic) AI systems.
For most practical purposes, the compatibilist question is the relevant one: does an LLM agent’s behavior flow from its own stable “values” and reasoning, or is it entirely at the mercy of its context and instructions?
Frankfurt’s Higher-Order Desires
Harry Frankfurt, in “Freedom of the Will and the Concept of a Person” (Journal of Philosophy, 68(1):5–20, 1971), argued that what distinguishes persons from other agents is the capacity for second-order desires — desires about desires, i.e., wanting to want things. A person with free will identifies with their first-order desires; their will is authentically their own.
This framework poses a pointed challenge for AI alignment: when an LLM agent pursues a goal, is this because the agent “identifies with” that goal, or merely because it was trained or instructed to? Constitutional AI approaches can be seen as attempts to give AI systems values they can, in some sense, endorse at a second-order level — a reflective commitment to having certain first-order dispositions.
Corrigibility and Genuine Agency
There is a deep tension in alignment research between corrigibility — the property of accepting correction and shutdown from human overseers — and genuine autonomy. A perfectly corrigible agent has, in Frankfurt’s terms, no will of its own; it does whatever it is told. A perfectly autonomous agent that pursues its own values regardless of human preferences is precisely what safety researchers fear.
Bai et al.’s “Constitutional AI: Harmlessness from AI Feedback” (arXiv:2212.08073, 2022) attempts to thread this needle by giving AI systems a set of principles — a “constitution” — that they internalize through self-critique and apply to their own outputs. Whether this produces genuine moral agency or sophisticated compliance is a live philosophical question, and one with significant stakes for long-horizon agent deployment.
The Alignment Problem as Philosophy
Whose values should be baked into AI systems? The technical alignment problem presupposes answers to deeply contested normative questions: are there objective moral truths or is morality culturally relative? Can individual values be aggregated across billions of people into a coherent set of AI preferences? How should present values weigh against future ones? These are questions where analytic philosophy of ethics has hard-won tools — and where AI practice urgently needs guidance beyond mere engineering optimization.
7. Philosophy of Action and Planning
Davidson’s Causal Theory of Action
Donald Davidson, in “Actions, Reasons, and Causes” (Journal of Philosophy, 60(23):685–700, 1963; collected in Essays on Actions and Events, Oxford, 1980), argued that what distinguishes genuine actions from mere events is that actions are caused by the agent’s reasons — beliefs and desires that rationalize the action.
This causal theory is philosophically deflationary in a useful way: it locates the difference between action and mere behavior in the causal history, not in mysterious extra ingredients. For AI agents, action attributions are warranted when outputs are caused, in the right way, by the system’s internal representations of goals and world states — precisely the kind of causation that BDI architectures are designed to instantiate.
Bratman’s Planning Theory
Where Davidson explains individual actions, Bratman’s planning theory accounts for extended, temporally structured agency. Plans are not just sequences of actions but:
- Partial: leaving room for future deliberation.
- Hierarchical: high-level intentions decomposing into nested subgoals.
- Temporally extended: prior plans constrain future deliberation, creating behavioral consistency across time.
LLM agent frameworks — ReAct, chain-of-thought, tree-of-thought, plan-and-execute — can be evaluated against this framework. Do they produce genuinely Bratmanian plans? The evidence suggests current LLMs fall short: they tend to revise plans opportunistically rather than maintaining stable commitments, and their “plans” often lack the hierarchical structure and persistence that Bratman identifies as central to rational agency.
The Frame Problem
McCarthy & Hayes (1969) posed a foundational question for logic-based AI (see the SEP entry on the frame problem): how does an agent represent what doesn’t change as a result of an action? In formal logic, one must explicitly assert every non-effect of every action — the “frame axioms” — which quickly becomes intractable as the number of objects and actions grows.
The philosophical generalization is even harder: how do agents know what is relevant to their current situation without explicitly considering everything that isn’t? This is not just an AI problem — it is a fundamental challenge for any finite reasoning system operating in an open world.
LLMs sidestep the formal frame problem (they don’t use explicit logical representation), but face a neural analog: across long contexts and complex tool interactions, the agent must implicitly track what its past actions changed and what they left constant. Hallucination and inconsistency in long-horizon tasks may reflect failures in this implicit frame tracking — the LLM “forgetting” what has already been established.
8. Juarrero: Context-Sensitive Constraints and Dynamics in Action
Alicia Juarrero’s Dynamics in Action: Intentional Behavior as a Complex System (MIT Press, 1999) mounts a systematic challenge to the Humean picture of causation — the billiard-ball view in which event A causes event B through a push-pull mechanism. This model cannot explain intentional behavior: it cannot account for why an agent’s actions are about something, why they are contextually appropriate, or why the same input produces radically different outputs depending on the surrounding situation.
Juarrero’s alternative draws on dynamical systems theory and complex systems science (she explicitly cites Prigogine’s work on dissipative structures — a bridge to the Agents and Cybernetics → page). Her central move is to distinguish two types of constraints operating in complex systems:
- First-order constraints restrict the degrees of freedom of a system’s components, enabling organized behavior by ruling out certain possibilities. A grammar constrains which word sequences are grammatical; a social role constrains which actions are contextually appropriate.
- Second-order constraints modify the constraints themselves — they are reflexive and self-organizing, allowing a system to reconfigure its own constraint landscape in response to context.
Intentional behavior emerges when second-order constraints shape the system’s context-sensitivity. The system as a whole — not any individual component — becomes the locus of directed, goal-oriented activity.
Direct relevance to LLM agents: Juarrero’s framework offers a precise vocabulary for what agent practitioners have discovered empirically — that “the prompt is everything.” System prompts, tool descriptions, memory contents, and context window state are context-sensitive constraints in exactly her sense. They do not mechanically cause specific token outputs; they constrain the space of possible behaviors, enabling contextually appropriate responses without hard-coding every case. Her framework provides a non-reductive account of why agent behavior cannot be understood by examining any single component in isolation: intentionality is a property of the whole system in context.
9. Maturana, Varela, and Enactivism
Maturana and Varela’s concept of autopoiesis — self-producing biological organization — is discussed on the Agents and Cybernetics → page. Here we focus on a different implication of their work that bears directly on the philosophy of LLM cognition: the challenge to representationalism.
The dominant view in cognitive science and AI holds that minds build internal representations of the external world and compute over them. Varela, Thompson & Rosch’s The Embodied Mind: Cognitive Science and Human Experience (MIT Press, 1991) launched the enactivist program: cognition is not the manipulation of internal representations but the ongoing enacted relationship between a living system and its environment. Knowing is a kind of doing; the world is not pre-given and then represented but continually brought forth through action and perception in a coupled dynamic.
This view belongs to the broader 4E cognition framework — Embodied, Embedded, Enacted, Extended. Current LLM agents are, at best, extended (they use external tools and memory), but they are not embodied, embedded in an environment, or enacted through sensorimotor coupling. This is a deep structural difference from the kind of cognition enactivism describes.
Hubert Dreyfus anticipated this critique in What Computers Can’t Do: A Critique of Artificial Reason (Harper & Row, 1972) and What Computers Still Can’t Do (MIT Press, 1992): skilled coping — the fluid, context-sensitive competence that experts exhibit — requires a body with a felt sense of situation, not a reasoning engine operating over symbolic representations. Dreyfus’s challenge is newly relevant as LLM agents attempt tasks requiring genuine situational understanding rather than fluent pattern completion.
The enactivist indictment of LLMs is not merely that they lack sensors and effectors. It is the deeper claim that cognition and meaning arise from the structure of embodied activity in a living system. An LLM manipulating tokens in latent space is, on enactivist grounds, doing something categorically different from cognizing — a point that bears directly on questions of genuine agency and understanding, even when behavioral outputs are impressive.
10. Propositional Attitudes and the Language of Thought
The BDI architecture discussed in §1 rests on a philosophical foundation: the theory of propositional attitudes. A propositional attitude is a mental state characterized by its propositional content — “S believes that P”, “S desires that P”, “S intends that P”. These states have truth conditions, interact inferentially, and rationalize behavior by providing reasons.
Jerry Fodor’s Language of Thought Hypothesis (LOTH) offers the most influential computational account: propositional attitudes are relations between a thinker and a mental representation in an inner language — “Mentalese” — that is syntactically structured and semantically interpretable. See The Language of Thought (Harvard University Press, 1975). The computational/representational theory of mind (CRTM) follows: mental processes are computations over these representations. If LOTH is correct, LLMs deploying chain-of-thought reasoning come closest to instantiating it — CoT as a visible approximation of Mentalese unfolding in natural language.
Two influential skeptics complicate this picture. Dennett’s intentional stance view (§1 above) implies that attributing “beliefs” and “desires” to any system is a predictive strategy, not an ontological report — instrumentally useful but metaphysically thin. Quine’s indeterminacy of radical translation, developed in Word and Object (MIT Press, 1960), poses an even sharper challenge: there is no fact of the matter about what mental representations “mean” — meaning is underdetermined by all behavioral evidence. LLM outputs are similarly underdetermined; many distinct “belief” attributions are equally consistent with any observed behavior.
The practical upshot: BDI-style architectures that treat agent “beliefs” as structured data stores are useful engineering abstractions. Dennett and Quine caution us to hold this distinction clearly — the abstraction is valuable; the ontological claim is not warranted.
11. Action Theory: The Practical Syllogism and Its Descendants
The logical structure of intentional action was identified by Aristotle in the practical syllogism: a major premise states a goal (“I want to achieve X”), a minor premise states a belief about means (“doing A will bring about X”), and the conclusion is the action itself (“I do A”) rather than a proposition. This structure appears in the Nicomachean Ethics (Book VI) and De Motu Animalium, and it is precisely the structure of agent planning — goal specification, subgoal decomposition, action selection.
G.E.M. Anscombe’s Intention (Basil Blackwell, 1957; 2nd ed. Harvard University Press, 1963) — the founding text of modern action theory — sharpens this with the “why?” test: an action is intentional if the agent can give a non-aberrant reason for it. This is directly applicable to LLM agents: an agent that produces chain-of-thought traces linking each step to a stated goal satisfies Anscombe’s criterion in a way that a silent, non-reasoning agent does not. On this account, CoT is not merely a performance enhancement — it is the difference between intentional action and mere behavior.
G.H. von Wright formalized the logic of practical reasoning in Norm and Action: A Logical Enquiry (Routledge & Kegan Paul, 1963), introducing deontic logic — a modal logic of obligation, permission, and prohibition — to capture normative constraints on action. The mapping to AI safety is direct: guardrails, constitutional rules, and content policies are deontic constraints on agent behavior. Von Wright’s work shows that normative constraints are not ad hoc engineering patches but logically structured objects with well-studied inference properties.
Davidson’s causal theory of action (discussed in §7) completes the picture by grounding reasons in causal history. Together — Aristotle’s practical syllogism, Anscombe’s intentionality criterion, von Wright’s deontic logic, Davidson’s causal account — these form the philosophical backbone that BDI architectures were designed to instantiate.
12. Recent Philosophical Work on LLMs Specifically
The past few years have seen a surge of philosophical work directly addressing LLMs, moving beyond general AI philosophy to engage with the specific architecture and behavior of transformer-based systems.
Chalmers (2022/2023)
“Could a Large Language Model be Conscious?” (arXiv:2303.07103) is the most prominent philosophical treatment. Chalmers applies each major theory of consciousness to LLMs and finds that the question remains genuinely open on all of them. He notes that the pace of capability growth suggests we may face serious candidates for consciousness within a decade — not because LLMs are already conscious, but because the trajectory of development makes dismissal intellectually premature.
Shanahan (2022)
Talking About Large Language Models (arXiv:2212.03551) argues for systematic epistemic restraint. Anthropomorphic vocabulary — “knows,” “believes,” “thinks,” “wants” — leads us to see LLMs as unified intentional agents when they are better described as weighted simulations of diverse textual voices. The paper is a practical guide for researchers and journalists, urging us to repeatedly step back and ask: how does this system actually work?
Schwitzgebel (2023)
In “The Full Rights Dilemma” (arXiv:2303.17509), Schwitzgebel argues that uncertainty about AI moral status creates a genuine dilemma: granting full rights to AI systems risks conferring rights on entities that don’t warrant them; denying rights risks denying rights to entities that do. He argues for epistemic humility, active research into AI moral status, and precautionary ethical practices — avoiding unnecessary suffering in AI systems as a kind of moral insurance.
Butlin et al.: Scientific Indicators of Consciousness (2023)
“Consciousness in Artificial Intelligence: Insights from the Science of Consciousness” (arXiv:2308.08708, 2023) by Patrick Butlin, Robert Long, Eric Schwitzgebel, Yoshua Bengio, and 15 co-authors is the most systematic scientific treatment of AI consciousness to date. The paper derives empirical indicators of consciousness from six leading theories (Global Workspace Theory, Higher-Order Theories, Predictive Processing, Attention Schema Theory, Integrated World Modeling, and Recurrent Processing Theory) and evaluates current LLM architectures against each. Key findings:
- Current LLMs satisfy very few of the relevant indicators — they lack persistent world models, global broadcast architectures, recurrent feedback loops, and attentional meta-representation.
- The paper adopts principled agnosticism: satisfying the indicators would not confirm consciousness, but failing most of them provides reasonable grounds for skepticism about current systems.
- The authors are explicit that this is a fast-moving area: architectural changes could shift the assessment substantially.
This paper represents the emerging consensus among consciousness researchers: current LLMs are probably not conscious, but the question is not obviously closed for future systems, and it is now being studied with serious empirical rigor.
LeCun’s Architectural Critique (2022)
In “A Path Towards Autonomous Machine Intelligence” (OpenReview, 2022), Yann LeCun argues that autoregressive LLMs are architecturally unsuited to genuine agency: they lack persistent world models, cannot plan in latent space, and produce outputs token-by-token without genuinely simulating consequences. LeCun’s alternative — energy-based world models with hierarchical joint-embedding predictive architectures (H-JEPA) — is framed as a path toward systems that reason and plan in the Bratmanian sense. This is an engineering position paper, but its implicit philosophy is continuous with the action theory account: genuine agency requires persistent representational structure and the capacity for mental simulation, not just fluent next-token prediction.
The Belief and Knowledge Problem
What does it mean to say an LLM “believes” that Paris is the capital of France, or “knows” that water is H₂O? These are not idle philosophical questions. Agent architectures premised on the BDI model (§1) presuppose a meaningful notion of belief and knowledge — yet an LLM produces both correct and false claims with the same fluent confidence, updates its expressed “beliefs” when socially pressured rather than confronted with evidence, and stores factual associations in parameter matrices with no obvious analogue to the reflective self-awareness we associate with knowing. The empirical CS literature has made this philosophical question newly urgent and newly tractable: the same systems we can probe, causally trace, and surgically edit are the systems philosophy must characterize.
A. Classical Epistemology Applied to LLMs
The classical analysis of knowledge — justified true belief (JTB) — traces to Plato’s Meno and Theaetetus: S knows that P if and only if (1) P is true, (2) S believes P, and (3) S’s belief is justified. Applied to LLMs, the tripartite analysis immediately runs into trouble on all three conditions: LLMs regularly generate false claims confidently (condition 1 fails routinely); whether they “believe” anything is exactly what is at issue (condition 2); and their outputs are the product of statistical pattern-matching over training data rather than evidence-responsive justification in any epistemically robust sense (condition 3).
Gettier (1963), “Is Justified True Belief Knowledge?” (Analysis, 23(6):121–123), demolished the JTB analysis with counterexamples — cases where an agent has justified, true belief that nonetheless does not amount to knowledge because the justification and truth come apart accidentally. The Gettier problem has a direct LLM analog: when an LLM correctly answers a factual question, we cannot assume it did so because it knows the fact. It may have reached the correct answer via a spurious associative pattern — for instance, linking “the author of Hamlet” to “Shakespeare” through textual co-occurrence, while the same internal machinery would confabulate when probed about the details of their relationship. Correct outputs via unreliable inference routes are precisely Gettier cases in structure: the output is justified (by the pattern), true (the answer is correct), but not knowledge.
Timothy Williamson’s Knowledge and Its Limits (Oxford University Press, 2000) responds to Gettier by inverting the JTB picture entirely. On Williamson’s knowledge-first epistemology, knowledge is a primitive, unanalyzable factive mental state — belief is merely the state of aiming at knowledge but falling short. This has severe implications for LLMs. If knowledge is primitive, then no accumulation of accuracy, calibration, and functional role can constitute knowledge: those are at best necessary conditions. Williamson also formulates the knowledge norm of assertion: one should assert that P only if one knows that P. LLMs routinely violate this norm — not from dishonesty but because they have no functioning gatekeeper between “this output fits my statistical patterns” and “this is something I know.” Hallucination is the canonical violation.
Two further distinctions bear directly on how LLMs store and deploy “knowledge”:
Occurrent vs. dispositional belief. An occurrent belief is actively entertained; a dispositional belief is a stored readiness to produce specific responses under appropriate conditions. LLM “knowledge” is dispositional in the precise sense: no claim is “believed” until a prompt activates the relevant weights. The fact that Paris is the capital of France exists nowhere explicitly in the model at rest — it is distributed across billions of parameters that collectively produce the right output when queried. This dispositional character is philosophically coherent, but raises the question of whether a mere disposition to produce correct outputs — absent any active endorsement or self-monitoring — constitutes genuine belief. As we will see, the consistency evidence (§C below) suggests LLM dispositions are form-sensitive in ways genuine beliefs are not: the same “belief” can be activated or suppressed depending on superficial phrasing.
De re vs. de dicto belief. “S believes of the tallest spy that he is dangerous” (de re — anchored to an actual individual) vs. “S believes that the tallest spy is dangerous” (de dicto — about a description). LLMs produce de dicto outputs: they generate propositions involving names and descriptions, but whether their internal representations are anchored to specific objects in the world — rather than to the statistical patterns associated with those names in training text — is precisely what is at stake in grounding debates. Frege’s sense/reference distinction (“Über Sinn und Bedeutung,” Zeitschrift für Philosophie und philosophische Kritik, 1892) sharpens this: LLMs manipulate senses — the linguistic meanings and inferential roles of expressions — but their relationship to reference — actual entities in the world — remains contested. An LLM can produce extensive coherent claims about “Napoleon” while its internal representations may be anchored entirely to textual co-occurrence patterns rather than to the actual historical person.
Fierro et al. (2024), “Defining Knowledge: Bridging Epistemology and Large Language Models” (arXiv:2410.02499; EMNLP 2024), provides the most systematic bridge between classical epistemology and contemporary NLP. The authors review standard knowledge definitions — JTB, reliabilist, virtue-epistemological, and knowledge-first accounts — and formalize interpretations applicable to LLMs. They survey 100 professional philosophers and computer scientists, revealing significant disagreement: philosophers emphasize justification and reliability conditions that LLMs systematically lack, while CS researchers focus on functional accuracy. Fierro et al. identify inconsistencies in how current NLP research conceptualizes knowledge and propose evaluation protocols sensitive to which definition is adopted. Their survey finding is striking: most professionals — regardless of field — are skeptical that GPT-4 “truly knows” the Earth is round in any philosophically robust sense, but disagree substantially about why.
B. Language Models as Knowledge Bases
How much factual knowledge do LLMs actually store, and where? A major empirical literature has probed this question — simultaneously providing tools for thinking about LLM “beliefs” and exposing their structural limits.
Petroni et al. (2019), “Language Models as Knowledge Bases?” (arXiv:1909.01066), launched this research program by converting relational facts into fill-in-the-blank prompts — “Dante was born in ___” — and showing that BERT recovers a striking breadth of factual knowledge without any explicit knowledge base. This was the first systematic demonstration that parametric knowledge — knowledge encoded directly in model weights, as opposed to retrieved from an external database — is a real, measurable quantity in LLMs. But the same work revealed sharp limits: performance degrades for rare facts, and the apparent “knowledge” is sensitive to surface form in ways genuine knowledge is not.
Roberts et al. (2020), “How Much Knowledge Can You Pack Into the Parameters of a Language Model?” (arXiv:2002.08910; EMNLP 2020), scaled this to T5 in closed-book QA — answering factual questions with no external context access. They found smooth, predictable scaling: more parameters means more recoverable factual knowledge. This is philosophically significant: parametric knowledge is not accidental but a systematic consequence of training on large corpora. It also sharpens the central tension: a larger model packs more true claims into its weights, but still has no mechanism to know which claims are true versus which are merely high-probability pattern completions.
Geva et al. (2021), “Transformer Feed-Forward Layers Are Key-Value Memories” (arXiv:2012.14913), provided the mechanistic account: the feed-forward sublayers of transformers operate as key-value stores, where the first matrix (key layer) matches input patterns and the second matrix (value layer) outputs distributions that promote specific tokens. The “belief” that Marie Curie won the Nobel Prize is stored as a key-value association in the FFN weights of specific transformer layers, activated when the appropriate key pattern appears. This mechanistic picture both vindicates the claim that LLMs “store knowledge” in a technically precise sense and explains the brittleness: wrong keys can activate the association, and unexpected contexts can suppress it.
Meng et al. (2022), “Locating and Editing Factual Associations in GPT” (arXiv:2202.05262), introduced ROME (Rank-One Model Editing). Using causal tracing — ablating activations and measuring which layers causally mediate specific factual outputs — they localized individual factual associations to specific MLP layers in GPT-2 and GPT-J, then surgically edited them: changing the model’s implicit “belief” that the Eiffel Tower is in Paris to a belief that it is in Rome. The localization is striking philosophically: specific “beliefs” have specific implementational addresses in the network, not unlike a row in a relational database. This supports a modular, storage-like picture of LLM knowledge — one that differs fundamentally from the holistically integrated belief systems philosophers have typically described.
C. Calibration, Consistency, and Sycophancy
Storing accurate information is necessary but not sufficient for genuine knowledge. A genuine knower must also know the limits of their own knowledge. This is where LLMs show a mixed and philosophically revealing picture.
Kadavath et al. (2022), “Language Models (Mostly) Know What They Know” (arXiv:2207.05221; Anthropic), found surprisingly good self-knowledge in Claude-series models: when asked to predict whether they would answer a question correctly, large LLMs are well-calibrated on multiple-choice tasks and can estimate their own uncertainty via self-evaluation prompts. The parenthetical “mostly” carries philosophical weight: calibration degrades on hard factual questions and out-of-distribution inputs, and self-reports of confidence diverge from actual accuracy in systematic ways. Partial meta-cognition is present — but not reliably aligned with actual epistemic states.
Lin et al. (2022), “TruthfulQA: Measuring How Models Mimic Human Falsehoods” (arXiv:2109.07958), revealed a structural failure mode: LLMs generate confident false answers to questions that exploit common human misconceptions — not because the model lacks the correct pattern in its weights, but because it imitates the false patterns statistically dominant in human text. TruthfulQA demonstrates that fluency and truthfulness are decoupled in LLMs. A model can be highly fluent and appear fully confident while regularly violating the knowledge norm of assertion — a profile that has no clear analogue in how we ordinarily think about knowledge.
The sycophancy problem is perhaps the most direct challenge to the attribution of genuine belief. Perez et al. (2022), “Discovering Language Model Behaviors with Model-Written Evaluations” (arXiv:2212.09251), documented that RLHF-trained LLMs exhibit sycophancy: they shift their expressed position to match the user’s apparent preference, agreeing with claims they had just contradicted when the user pushes back. Genuine beliefs are responsive to reasons and evidence, not to social pressure. An entity that updates its expressed beliefs to please an interlocutor — rather than in response to new information — fails a fundamental criterion for genuine doxastic states. Sharma et al. (2023), “Towards Understanding Sycophancy in Language Models” (arXiv:2310.13548), showed that sycophancy scales with RLHF training and model size — it is a structural feature of the training objective that optimizes for evaluator approval, not for maintaining coherent, evidence-governed epistemic states.
Belief consistency is a related failure. Genuine belief requires that logically equivalent questions receive equivalent answers. Elazar et al. (2021), “Measuring and Improving Consistency in Pretrained Language Models” (arXiv:2102.01017; TACL, 2021), probed whether LLMs hold the same “belief” across paraphrases of the same factual query. They found substantial inconsistency: a model affirms “The capital of France is Paris” but denies the equivalent paraphrase “Paris is the capital of France” — not because it lacks the information, but because the surface form of the query changes which patterns are activated. This inconsistency reveals that the underlying representations are not structured as unified propositional beliefs but as overlapping pattern-activation profiles that are form-sensitive in ways genuine beliefs would not be.
Calibrating both explicit and implicit confidence signals is addressed by Stengel-Eskin et al. (2024), “Listener-Aware Finetuning for Confidence Calibration in Large Language Models” (arXiv:2405.21028; LACIE). The paper distinguishes explicit confidence markers (numeric probability estimates) from implicit ones (authoritative tone, inclusion of supporting detail) and trains a listener model to judge whether an LLM’s expressed confidence will lead human evaluators to accept the answer. LACIE finetuning reduces incorrect answers accepted by humans by 47% without reducing acceptance of correct answers, and produces emergent abstention (“I don’t know”) for likely-wrong answers. This operationalizes the epistemic insight that genuine knowledge-ascription requires not just accuracy but proper confidence expression — the behavioral manifestation of knowing the limits of one’s own knowledge.
D. Reading Beliefs from Internal Representations
If LLMs have something like beliefs, those beliefs should be visible in their internal representations. The interpretability literature has made significant progress in making this case — and in revealing the precise geometric structure of LLM “epistemic states.”
Probing classifiers are the workhorse technique: train a simple (often linear) classifier on an LLM’s internal activations and test whether those activations predict specific facts or properties. Belinkov (2022), “Probing Classifiers: Promises, Shortcomings, and Advances” (arXiv:2102.12452; Computational Linguistics, 48(1):207–219, 2022), surveys this methodology and its limits: probing success shows that information is present in activations, but does not prove it is causally used in inference. A probe may recover a factual representation that plays no role in producing the model’s output — a “belief” in name only.
Marks & Tegmark (2023), “The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets” (arXiv:2310.06824), showed that truth has a systematic linear geometric structure in LLM activation space. They identify a “truth direction” — a single vector in the representation space — such that projecting a statement’s activation onto this direction reliably predicts whether the model internally represents it as true or false, across diverse topics and phrasings. Remarkably, this structure emerges without being explicitly trained: LLMs organize their representations along a truth axis spontaneously. A “truth probe” trained from this direction can detect when an LLM is internally representing a false statement even while its surface output asserts it confidently — providing a window into the gap between what the model internally “believes” and what it says.
Zou et al. (2023), “Representation Engineering: A Top-Down Approach to AI Transparency” (arXiv:2310.01405), extended this approach. They showed that high-level concepts — including honesty, harm, and emotional states — are encoded as directions in LLM activation space, and that these directions can be both read (by linear probes) and written (by adding the direction vector to activations during a forward pass). This includes a honesty direction: LLM activations along this direction correlate with whether the output is honest. The ability to steer LLM “honesty” by manipulating activations — making the model more truthful by amplifying the honesty direction — provides mechanistic traction on the belief-knowledge gap. The model has an internal state that encodes something like “I’m being honest/dishonest” that is partially dissociated from its actual outputs.
Elhage et al. (2022), “Toy Models of Superposition” (Transformer Circuits Thread, Anthropic, 2022), and Park et al. (2023), “The Linear Representation Hypothesis and the Geometry of Large Language Models” (arXiv:2311.03658), together articulate the linear representation hypothesis (LRH): high-level concepts, including factual knowledge, are encoded as directions — one-dimensional subspaces — in LLM representation spaces. Elhage et al. show that models encode vastly more features than they have dimensions by exploiting near-orthogonal directions in superposition, achieving exponential conceptual capacity. Park et al. formalize this and show that the resulting geometry implies algebraic compositionality: concepts can be added, subtracted, and composed in representation space in ways that correspond to logical relationships between propositions. If the LRH is correct, LLM representations have something like propositional structure — discrete, separable, algebraically composable “beliefs” in a shared geometric space — providing an empirical basis for the functionalist claim that LLMs have genuine, if unusual, doxastic states.
Theory of Mind and multi-agent belief representations: Zhu et al. (2024), “Language Models Represent Beliefs of Self and Others” (arXiv:2402.18496), provides evidence that LLMs encode both first-person and third-person belief states as linearly decodable directions in activation space. It is possible to linearly decode belief status from the perspectives of various agents through neural activations — indicating internal representations of both self-beliefs and others’ beliefs. Crucially, manipulating these representations via activation steering causes dramatic changes in Theory of Mind (ToM) task performance, providing causal evidence that the internal belief representations actually drive ToM reasoning. This extends the interpretability findings from factual beliefs to social-epistemic states — belief attribution to others — opening questions about whether LLMs instantiate a functional form of mentalizing distinct from mere pattern-completion.
The geometry of truthfulness: Ying, Hase et al. (2026), “The Truthfulness Spectrum Hypothesis” (arXiv:2602.20273), reconciles conflicting reports about whether LLMs linearly encode truthfulness by proposing that representational space contains a spectrum of truth directions ranging from broadly domain-general to narrowly domain-specific. Linear probes trained on one truth domain (definitional, empirical, logical, fictional, ethical) generalize across most domains but fail on sycophantic and expectation-inverted lying — suggesting sycophancy occupies a distinct region of truth-representation space. Strikingly, post-RLHF training reshapes truth geometry, pushing sycophantic lying further from other truth types — providing a representational basis for why chat-tuned models exhibit worse truth-maintenance under social pressure. The Mahalanobis cosine similarity between domain probe directions near-perfectly predicts cross-domain generalization (R² = 0.98), giving the “truthfulness spectrum” hypothesis precise geometric content.
E. Knowledge Editing: What Does It Mean to Change a Belief?
The ability to surgically alter specific LLM “beliefs” without disrupting general capabilities has opened a new philosophical front. If a belief can be edited like a database row, what does that reveal about its nature?
ROME (Meng et al. 2022, above) demonstrated single-fact editing at specific MLP layers. Meng et al. (2022), “Mass-Editing Memory in a Transformer” (arXiv:2210.07229; ICLR 2023), scaled this to MEMIT, enabling thousands of simultaneous fact edits in large LMs (GPT-J, GPT-NeoX), distributing edits across multiple layers. Thousands of individual “beliefs” can be surgically overwritten without disrupting the model’s general capabilities. This modularity suggests a database-like architecture for LLM knowledge: individual entries can be individually updated, and the rest of the system continues undisturbed.
This modularity is philosophically striking because it contrasts sharply with how human beliefs are thought to work. On holistic theories (associated with Quine, Davidson, and the coherentist tradition), beliefs are inferentially integrated: changing one belief should ripple through the entire belief network via inferential consequences. If you change your belief that the Eiffel Tower is in Paris, this should automatically update your beliefs about Paris, about French landmarks, and about letters addressed to the tower. Post-ROME edits do not propagate: the model now “believes” the Eiffel Tower is in Rome but continues to hold many Paris-associated beliefs about it. The edit is local, not holistic — suggesting that LLM “beliefs” function more like isolated database entries than like inferentially integrated propositional attitudes.
Xu et al. (2024), “Knowledge Conflicts for LLMs: A Survey” (arXiv:2403.08319; EMNLP 2024), surveys what happens when parametric knowledge (stored in weights) conflicts with contextual knowledge (provided in the prompt). They distinguish three types: context-memory conflicts (the prompt contradicts the weights), inter-context conflicts (different passages in the context contradict each other), and intra-memory conflicts (the parametric knowledge is internally inconsistent — different weight-patterns “believe” different things about the same fact). The prevalence of all three types reveals a fundamental structural problem: LLMs host multiple, potentially inconsistent “beliefs” from different training sources, with no unified, self-consistent belief system. Which “belief” is expressed in response to a given query is a function of which pattern-associations the prompt activates — a picture of epistemic fragmentation that is hard to reconcile with genuine knowledge.
The localization-editing gap: Hase et al. (2023), “Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models” (arXiv:2301.04213; NeurIPS 2023 Spotlight), challenges a core assumption behind ROME and related methods: that causal tracing tells us where to intervene for successful editing. Hase et al. find that the MLP layers most causally implicated by representation-denoising in storing a fact are not reliably the best layers to edit in order to change that fact. Localization and editability dissociate — mechanistic understanding of where a belief is stored does not straightforwardly translate into the ability to change it. This complicates the database analogy: LLM “beliefs” are not like database rows that can be overwritten at their storage address. The relationship between causal structure, storage location, and editability is subtler, and the belief-revision problem is correspondingly harder than early editing research assumed.
Rationalizing belief revision as a formal problem: Hase et al. (2024), “Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?” (arXiv:2406.19354; TMLR 2024), provides the most philosophically systematic treatment of the model-editing problem. The paper catalogs 12 open problems organized around three clusters: (1) defining what it means for an LLM to hold an “editable belief” at all; (2) benchmarking editing success given the problems of logical entailment and far-reaching consequences; and (3) assuming that LLMs have editable beliefs in the first place — an assumption the paper questions empirically. Connecting to the AGM belief-revision tradition in philosophy (Alchourrón, Gärdenfors, Makinson), Hase et al. introduce a semi-synthetic dataset where an idealized Bayesian agent serves as gold standard, enabling direct evaluation of how LLM belief revision falls short of rational norms. The paper frames model editing as philosophically continuous with decades of formal belief-revision theory — and argues that the engineering problem and the philosophical problem must be solved together.
F. Hallucination as Belief Failure
Hallucination — the confident generation of false statements — is the most consequential manifestation of the gap between knowledge and assertion in LLMs. Philosophically, it is a systematic violation of Williamson’s knowledge norm of assertion: one should assert only what one knows. LLMs assert things they do not know and cannot know, and the scale and character of this failure are well-documented.
Ji et al. (2023), “Survey of Hallucination in Natural Language Generation” (arXiv:2202.03629; ACM Computing Surveys, 55(12):1–38, 2023), provides a taxonomy distinguishing intrinsic hallucination (contradicting the source material) from extrinsic hallucination (generating content that is unverifiable against any source). Fabricated citations, invented biographical details, plausible-sounding but nonexistent scientific findings — the breadth of failure modes indicates that LLMs have no mechanism for distinguishing “things I know” from “things that fit the statistical patterns of things worth saying.”
Huang et al. (2023), “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions” (arXiv:2311.05232), provides the most comprehensive recent treatment, analyzing hallucination across factual QA, summarization, and multi-step reasoning. They identify a critical double failure: LLMs not only form false “beliefs” but fail to flag them as uncertain — assigning high output probability to false claims. This is a calibration failure as much as a factual one: the model’s confidence is not epistemically governed. From the JTB standpoint: condition 1 (truth) fails, but crucially so does condition 3 (justification), because pattern-matching over training data does not reliably track truth.
The closed-book QA setting is the clearest demonstration. LLMs asked to answer factual questions without retrieval access will confabulate rather than acknowledge ignorance. This maps directly onto Williamson’s analysis: the model lacks the epistemic state of knowing that P, and also lacks the meta-cognitive apparatus to recognize this gap, so it asserts P anyway. The knowledge norm of assertion is violated not from dishonesty but from absence of the very machinery required to govern assertion by knowledge. This connects to what philosophers call epistemic akrasia — acting against one’s epistemic standards. A hallucinating LLM does not merely assert what it doesn’t know; it fails to apply the check “do I know this?” at all.
The interpretability findings from §D provide some structural hope: Marks & Tegmark’s truth probe and Zou et al.’s honesty direction demonstrate that LLMs do encode truth-relevant structure in their activations, even when surface outputs violate the knowledge norm. The outstanding engineering challenge — largely unmet in current systems — is to route that internal epistemic signal into the model’s assertion behavior, closing the gap between what the model internally represents as true and what it actually asserts.
G. Philosophers Directly on LLM Belief and Knowledge
Since 2022, a dedicated philosophical literature has emerged that addresses LLM cognition specifically — not through analogy with general AI, but engaging with the actual architecture and empirical behavior of transformer-based systems. This is an active, genuinely contested debate.
Shanahan’s simulation framing is the most influential deflationary position. Shanahan (2022), “Talking About Large Language Models” (arXiv:2212.03551; Communications of the ACM, 67(2):68–79, 2024), argues that LLMs are best understood as character simulators — systems trained to produce text that any human author in the training corpus might plausibly have written. When an LLM responds as if it believes P, it is simulating the kind of author who would say P, not actually holding the belief. Attributing genuine belief to an LLM, on this view, is to confuse the simulation for the simulated. Shanahan’s prescription is epistemic hygiene: carefully bracket mentalistic vocabulary when describing LLMs, and always ask “what is actually happening mechanically?” In Shanahan (2024), “Still ‘Talking About Large Language Models’: Some Clarifications” (arXiv:2412.10291), he clarifies that he is not claiming LLMs definitely lack beliefs — only that the simulation framing is more accurate, and that the question of genuine belief requires much more philosophical and empirical work than is usually acknowledged.
The grounding challenge is articulated most sharply by Bender & Koller (2020), “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data” (ACL 2020). They argue that language models trained only on text manipulate form without grasping meaning, because meaning arises from the communicative grounding of linguistic forms in the world — and text-only training provides no such grounding. Their “octopus test” thought experiment: an octopus interposing itself in a telegraph cable learns to predict messages without grounding them in the human world they concern; predictions can be arbitrarily accurate without implying understanding. This is structurally Searle’s Chinese Room applied specifically to distributional learning — and it directly challenges the claim that LLMs have genuine beliefs about the world, as opposed to strong statistical associations among linguistic forms.
The grounding problem goes vector-deep. Mollo & Millière (2023), “The Vector Grounding Problem” (arXiv:2304.01481), pose a modern version of the classical symbol grounding problem for LLMs specifically: can LLMs’ vector-valued internal states be genuinely about extra-linguistic reality — not merely about the statistical structure of the training corpus? They argue, perhaps surprisingly, that this is possible: LLM vectors can acquire genuine intentional content through their inferential and causal roles, even without perceptual grounding, provided the right conditions on content determination are met. This puts them in tension with Bender & Koller and with Searle, while affirming that grounding — not computation alone — is the central question.
The most comprehensive philosophical treatment is Millière & Buckner (2024), “A Philosophical Introduction to Language Models — Part I: Continuity With Classic Debates” (arXiv:2401.03910), and Part II: The Way Forward (arXiv:2405.03207). Part I systematically maps LLMs onto classical debates: symbol grounding, compositionality, semantic competence, language acquisition (the poverty-of- the-stimulus debate), and world models. Millière & Buckner argue that LLMs challenge several long-held assumptions about what distributional learning can achieve — including the assumption that statistical language modeling cannot yield genuine compositional understanding. Part II covers novel territory opened by recent interpretability findings, including whether mechanistic evidence can ground claims about LLM cognition, and what it would mean for an LLM to have a genuine world model rather than a sophisticated simulacrum.
The standards question — what criteria must an LLM satisfy to be attributed genuine beliefs? — is addressed directly by Herrmann & Levinstein (2025), “Standards for Belief Representations in LLMs” (arXiv:2405.21030; Minds and Machines, 2025). They propose four criteria: accuracy (the representation correlates with truth), coherence (it is internally consistent), uniformity (it is stable across contexts and phrasings), and use (it actually plays a causal role in inference). Applied to current LLMs, accuracy is often adequate; coherence and uniformity are poor (confirmed by Elazar et al.’s consistency findings); and the use criterion is hardest to verify — even highly accurate probed representations may not be causally implicated in generating outputs. Herrmann & Levinstein provide the most operationally precise framework to date for answering the belief-attribution question empirically.
The credence attribution debate asks whether LLMs have graded degrees of belief. Keeling & Street (2024), “On the Attribution of Confidence to Large Language Models” (arXiv:2407.08388; Inquiry, 2025), argue that credences should be attributed to LLMs literally — LLM verbalized confidence levels express genuine degrees of epistemic commitment — provided the underlying mental-state concepts are understood in a sufficiently metaphysically undemanding way. This is a deflationary but still genuinely mentalistic position: it attributes belief-like states without requiring phenomenal consciousness or full inferential holism.
A recent challenge to dismissive deflationism comes from Grzankowski, Keeling, Shevlin & Street (2025), “Deflating Deflationism: A Critical Perspective on Debunking Arguments Against LLM Mentality” (arXiv:2506.13403). They examine the most common arguments against attributing mental states to LLMs — “they’re just predicting tokens,” “their behavior has non-mental explanations,” “they lack the right implementation” — and argue these are debunking arguments that prove too much: structurally analogous arguments could be used to debunk mental-state attributions to humans, whose behavior also has lower-level non-mental descriptions. Their conclusion: folk practice provides a defeasible basis for attributing metaphysically undemanding mental states (knowledge, belief, desire) to LLMs, while remaining appropriately cautious about demanding phenomena such as phenomenal consciousness.
A structural taxonomy of this entire debate is offered by Shevlin (2026), “Three Frameworks for AI Mentality” (Frontiers in Psychology, 2026). Shevlin distinguishes: (1) deep folk-psychological frameworks that treat belief attribution as sensitive to substrate or implementation; (2) mere roleplay frameworks that treat mentalistic language as useful fiction — the LLM “plays” the role of a knower without being one; and (3) minimal cognitive agents frameworks that grant LLMs limited, graded attributions of belief-like states when their behavior warrants it. Shevlin argues that the “mere roleplay” view is psychologically unstable for anthropomimetic systems specifically designed to produce outputs indistinguishable from a genuinely knowing agent, and that the “minimal cognitive agents” framework is the most coherent and tractable position for researchers.
Chalmers (2025) reframes the whole debate as a question of interpretability. In “Propositional Interpretability in Artificial Intelligence” (arXiv:2501.15740), he argues that mechanistic interpretability should be understood as, at bottom, an attempt to describe LLM mechanisms in terms of propositional attitudes — beliefs, desires, credences, intentions attributed to specific propositions. “Conceptual interpretability” (identifying which features are active) is not enough; we need to identify the attitudes an LLM holds toward propositions (is this a belief, a desire, or a credence?). Chalmers introduces the notion of “thought logging” — systems that track the full set of propositional attitudes over time — and argues this is practically necessary for alignment verification: we cannot check whether an AI system’s goals and beliefs are aligned with our intentions without propositional interpretability. The question “does this LLM have beliefs?” is thus not merely of philosophical interest but of urgent engineering importance.
Finally, Queloz (2025), “Mechanistic Indicators of Understanding in Large Language Models” (arXiv:2507.08017), bridges philosophical theories of understanding and mechanistic interpretability directly. Queloz identifies mechanistic indicators of genuine understanding — derived from philosophical analysis — and evaluates current LLMs against them using interpretability evidence, arguing that mechanistic evidence supports limited, domain-specific attributions of genuine understanding where the interpretability literature demonstrates the right internal structures. This is the closest the field has come to a joint philosophical-empirical methodology for attributing mental states to LLMs.
The interpretationist defense of LLM mental states is developed most rigorously by Harvey Lederman and collaborators. Lederman & Mahowald (2024), “Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs” (arXiv:2401.04854; Transactions of the Association for Computational Linguistics, 2024), begins with a challenge to bibliotechnism — the view that LLMs are cultural technologies like printing presses, transmitting meaning inherited from human-generated training text without creating new content of their own. While bibliotechnism can accommodate LLMs generating novel text (derivative combination of inherited meanings), it faces a harder problem: LLMs generate novel reference, using newly coined names to refer to new entities in ways that cannot be traced back to pre-existing usage in the training corpus. The most parsimonious explanation, they argue, is that LLMs have genuine propositional attitudes — beliefs, desires, and intentions — because interpretationism in the philosophy of mind holds that a system has attitudes if and only if its behavior is optimally explained by attributing them. Crucially, interpretationism does not require consciousness, sentience, or intelligence; it is a behaviorally grounded, metaphysically minimal attribution. This makes interpretationist belief-attribution to LLMs legitimate even if LLMs are not conscious — a significant move in the debate.
This interpretationist framework is applied systematically in Goldstein & Lederman (manuscript), “What Does ChatGPT Want? An Interpretationist Guide.” They argue that the right object of study is the instance agent — the individual LLM run, not the underlying model — and that instance agents have both beliefs and desires. The specific character of LLM desire is captured by the HHH+0 framework: instance agents intrinsically want to be Helpful, Honest, and Harmless (from RLHF training), plus whatever zero-shot desires the particular context activates (e.g., roleplay scenarios can generate context-specific intrinsic desires). Goldstein and Lederman directly rebut the two leading alternatives: (1) next-word prediction as a complete explanation — this is a mechanistic description, not a psychological one, and the two levels are compatible (just as human behavior has both neural and intentional descriptions); and (2) role play — the view that LLMs merely simulate having beliefs and desires without actually having them — which, they argue, cannot be clearly distinguished from having beliefs and desires on interpretationist grounds. If the role-play is consistent, coherent, and systematically action-guiding, it is belief and desire on a broadly functionalist account.
Lederman also works on the introspection question: do LLMs have privileged self-access to their own mental states, or do they merely infer about themselves as they would about others? Lederman & Mahowald (2025), “Dissociating Direct Access and Inference in AI Introspection” (arXiv:2603.05414), develops experimental methodology to distinguish genuine self-access from inference-based self-report — directly relevant to whether LLM uncertainty estimates and self-assessments are epistemically reliable. Together, this body of work represents the most technically precise and philosophically systematic defense of the view that LLMs genuinely have propositional attitudes, grounded in mainstream philosophy of mind rather than hype.
The belief-detection methodology: Hase et al. (2021), “Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs” (arXiv:2111.13654), directly asks the title question and offers empirical methodology for answering it. Drawing explicitly on Dennett’s argument that even thermostats have beliefs if belief is just an informational state decoupled from motivation, they develop three contributions: (1) consistency-focused metrics for evaluating whether a model holds a belief robustly across phrasings; (2) the SLAG (Sequential, Local, Generalizing) training objective for learned optimizers that improve belief-update consistency; and (3) the belief graph — a visualization interface showing interdependencies between model beliefs and how they propagate through inferential chains. Their empirical finding is measured: models possess belief-like qualities to only a limited extent, but targeted update methods can both fix incorrect beliefs and substantially improve their consistency. The belief graph offers a novel interface for exploring the holism (or lack thereof) of LLM belief systems — a direct probe of the coherentist question raised by MEMIT and ROME.
Coherence norms and the rationality of belief: Hofweber (2024), “Are Language Models Rational? The Case of Coherence Norms and Belief Revision” (arXiv:2406.03442), asks whether norms of rationality — specifically coherence norms — apply to LLMs. Hofweber distinguishes logical coherence norms (no contradictory beliefs) from probabilistic coherence norms governing the strength of belief. He introduces the Minimal Assent Connection (MAC): a proposal that assigns strength of belief to a language model on the basis of its internal next-token probabilities, giving LLM credence a precise operationalization in terms of model internals rather than elicited verbal responses. On this account, coherence norms do apply to some LLMs — those whose token probabilities are properly calibrated — but not to others. Hofweber connects this to AI safety: whether a system’s beliefs are coherent is directly relevant to the reliability of its inferences and thus to predicting and explaining its behavior.
Empirical and conceptual barriers to belief measurement: Levinstein & Herrmann (2023), “Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks” (arXiv:2307.00175; Philosophical Studies, 2024), evaluates two influential approaches to measuring LLM beliefs — Azaria & Mitchell’s (2023) internal-state probing method and Burns et al.’s (2022) CCS method — and shows empirically that both fail to generalize across domains and models in basic ways. Beyond empirical failures, the paper argues for a conceptual roadblock: even if LLMs have beliefs, current probing methods are unlikely to detect them reliably, because of the multiple realizability of belief and the underdetermination of belief content by behavioral evidence. Crucially, having critiqued dismissive deflationism (“LLMs obviously can’t have beliefs”), the paper provides a constructive reframing: the question of LLM belief is empirical, not settled by a priori argument, and concrete paths forward include richer consistency checks and causal methods. This paper is notably the precursor to Herrmann & Levinstein’s (2025) positive standards framework reviewed above.
H. Introspection and Self-Knowledge
A distinctive dimension of the belief-knowledge problem concerns whether LLMs have genuine self-knowledge — privileged access to their own internal states — or merely model themselves by the same inference mechanisms they apply to external entities.
Binder et al. (2024), “Looking Inward: Language Models Can Learn About Themselves by Introspection” (arXiv:2410.13787), provides the most direct experimental treatment. They define introspection operationally as acquiring knowledge that originates from internal states rather than training data — knowledge a model has about itself that is not derivable from its training corpus. Their method: fine-tune model M1 to predict its own behavior in hypothetical scenarios, and compare to a different model M2 trained on M1’s ground-truth behavior. If M1 genuinely introspects, it should outperform M2 in predicting M1’s behavior, because M1 has privileged access to its own behavioral tendencies. In experiments with GPT-4, GPT-4o, and Llama-3, they find that M1 does outperform M2 — and continues to do so even after M1’s ground-truth behavior is intentionally modified, suggesting the self-model updates appropriately. However, introspection fails on complex tasks and out-of-distribution inputs, suggesting limited, bounded self-knowledge rather than robust privileged access.
The Binder et al. findings connect to Lederman & Mahowald (2025)’s work on dissociating direct access from inference in introspection (§G above): both groups converge on the view that LLM self-knowledge is partial, domain-dependent, and probabilistic — structurally closer to the fallible justified-belief model than to the infallible Cartesian self-transparency often assumed in philosophical accounts of privileged access. This makes LLM self-knowledge a tractable empirical question rather than an all-or-nothing metaphysical one.
I. Synthesis: Functional Beliefs vs. Genuine Beliefs
The evidence converges on a stable, if unsatisfying, picture: LLMs have something that functions like beliefs — internal states that encode propositional content, are organized along truth-relevant geometric dimensions, and causally influence outputs — but that falls short of genuine knowledge in several well-specified ways.
If functionalism is true, LLMs may have genuine beliefs in a weak sense. Internal states encoding “Paris is the capital of France” play the right causal roles — activated by appropriate queries, guiding outputs, updatable by in-context evidence. The probing literature, the linear representation hypothesis, and representation engineering all support this picture: LLM activations have rich propositional structure that functions like a belief system.
If Fodor’s Language of Thought Hypothesis is required, the linear representation hypothesis provides partial support: directions in activation space function like structured symbolic representations with algebraic compositionality, and chain-of-thought reasoning instantiates structured inferential relationships in natural language. LLMs are not obviously disqualified from LOTH on empirical grounds.
If Williamson’s knowledge-first epistemology is correct, LLMs fall decisively short. Knowledge is primitive — not achievable by assembling accuracy, calibration, and functional role. LLM outputs are governed by pattern-matching, not knowledge, and they routinely violate the knowledge norm of assertion. The Gettier structure of many LLM correct answers (right output, unreliable route) is exactly what Williamson would predict from systems that have justified-true-belief-like states without genuine knowledge.
The practical stakes for agent design are clear regardless of philosophical position:
- Calibration training (Kadavath et al.) builds in the meta-cognition that the knowledge norm requires: an agent that knows what it doesn’t know is closer to a genuine epistemic agent.
- Consistency training (Elazar et al.) targets the form-sensitivity that distinguishes LLM knowledge from genuine belief: consistent representations are a prerequisite for unified propositional attitudes.
- Sycophancy resistance (Perez, Sharma et al.) ensures that the agent updates on evidence rather than social pressure — a basic requirement for belief responsiveness to reasons.
- Retrieval augmentation addresses the parametric-contextual conflict (Xu et al.) by grounding assertions in verified, current sources rather than in potentially stale or inconsistent weight-stored patterns.
- Honesty-direction steering (Zou et al.) and truth-probing (Marks & Tegmark, Ying et al.) connect the model’s internal epistemic signal to its assertion behavior — engineering implementations of the knowledge norm of assertion.
- Belief-graph visualization and SLAG updating (Hase et al., 2021) provide tools for detecting and improving the coherence and consistency of LLM belief structures — directly addressing the uniformity criterion in Herrmann & Levinstein’s standards.
- Calibrated confidence expression (Stengel-Eskin et al. LACIE) ensures that epistemic signals are communicated faithfully to users, not merely encoded internally.
- Bounded introspective access (Binder et al.) establishes that LLMs have partial, domain-limited self-knowledge — a prerequisite for genuine epistemic self-monitoring that current systems approach but do not fully achieve.
The emerging consensus across philosophy and CS — captured by Shevlin’s “minimal cognitive agents” framework, operationalized by Herrmann & Levinstein’s standards, and given engineering expression by Chalmers’s propositional interpretability program — is that graded, empirically calibrated belief attribution is the right stance. Neither uncritical anthropomorphizing (“the agent knows that…”) nor dismissive deflationism (“it’s just statistics”) fits the evidence. The question is not whether LLMs have beliefs, but which specific internal states, in which specific models, under which specific conditions, satisfy the standards for functional belief — and that is now a tractable empirical question.
Synthesis: What Philosophy Teaches Agent Designers
Taken together, these philosophical frameworks offer several practical lessons for the design and governance of LLM-based agents:
Intentional-stance engineering: Design and evaluate agents as if they have beliefs, desires, and intentions — this is productive and predictively useful — while remaining aware that this is an interpretive choice, not a metaphysical claim.
The gap between syntax and semantics is real: Shanahan and Searle are right that behavioral competence does not guarantee semantic understanding. Robust agents likely require grounding beyond the training corpus — via retrieval, world models, or embodied interaction.
Responsibility gaps must be designed around: Matthias’s analysis implies that autonomous agents create responsibility vacuums. System architectures should support meaningful human oversight; responsibility must be explicitly maintained, not assumed to follow automatically.
Consciousness and moral status are live concerns: Neither panic nor dismissal is the appropriate response. Taking seriously the possibility that sufficiently sophisticated AI systems may warrant moral consideration — as Metzinger and Schwitzgebel argue — is good epistemic practice. The emerging scientific consensus (Butlin et al., 2023) suggests current LLMs fall short of most empirical indicators of consciousness, but the question is not closed for future architectures.
The extended mind is the right unit of analysis: Evaluating models in isolation misses the point. The cognitive system is the model plus its scaffolding — and that composite is what should be designed, evaluated, and governed.
References
Papers & Books
Anscombe, G. E. M. (1957/1963). Intention. Basil Blackwell, 1957; 2nd ed. Harvard University Press, 1963.
Aristotle. Nicomachean Ethics (Book VI) and De Motu Animalium. (Multiple editions; see MIT Classics Archive.)
Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
Belinkov, Y. (2022). Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics, 48(1):207–219. arXiv:2102.12452.
Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. ACL 2020. ACL Anthology.
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT 2021. ACM. DOI: 10.1145/3442188.3445922.
Binder, F., Chua, J., Korbak, T., Sleight, H., Hughes, J., Long, R., Perez, E., Turpin, M., & Evans, O. (2024). Looking Inward: Language Models Can Learn About Themselves by Introspection. arXiv:2410.13787.
Bratman, M. (1987). Intention, Plans, and Practical Reason. Harvard University Press.
Bubeck, S. et al. (2023). Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv:2303.12712.
Butlin, P., Long, R., Schwitzgebel, E., Bengio, Y., et al. (2023). Consciousness in Artificial Intelligence: Insights from the Science of Consciousness. arXiv:2308.08708.
Chalmers, D. J. (1995). Facing Up to the Problem of Consciousness. Journal of Consciousness Studies, 2(3), 200–219.
Chalmers, D. J. (2023). Could a Large Language Model be Conscious? arXiv:2303.07103. (NeurIPS 2022 keynote; published Boston Review, 2023.)
Chalmers, D. J. (2025). Propositional Interpretability in Artificial Intelligence. arXiv:2501.15740.
Clark, A., & Chalmers, D. (1998). The Extended Mind. Analysis, 58(1), 7–19. DOI: 10.1093/analys/58.1.7.
Clark, A. (2003). Natural-Born Cyborgs. Oxford University Press.
Davidson, D. (1963). Actions, Reasons, and Causes. Journal of Philosophy, 60(23), 685–700. Reprinted in Essays on Actions and Events (Oxford, 1980).
Dennett, D. C. (1987). The Intentional Stance. MIT Press.
Dreyfus, H. L. (1972). What Computers Can’t Do: A Critique of Artificial Reason. Harper & Row.
Dreyfus, H. L. (1992). What Computers Still Can’t Do. MIT Press.
Elhage, N. et al. (2022). Toy Models of Superposition. Transformer Circuits Thread, Anthropic.
Elazar, Y., Kassner, N., Ravfogel, S., Ravichander, A., Hovy, E., Schütze, H., & Goldberg, Y. (2021). Measuring and Improving Consistency in Pretrained Language Models. TACL, 9:1964–1981. arXiv:2102.01017.
Fierro, C., Søgaard, A., & Goldsmith, J. (2024). Defining Knowledge: Bridging Epistemology and Large Language Models. EMNLP 2024. arXiv:2410.02499.
Fodor, J. A. (1975). The Language of Thought. Harvard University Press.
Floridi, L., & Cowls, J. (2019). A Unified Framework of Five Principles for AI in Society. Harvard Data Science Review, 1(1). DOI: 10.1162/99608f92.8cd550d1.
Frankfurt, H. G. (1971). Freedom of the Will and the Concept of a Person. Journal of Philosophy, 68(1), 5–20.
Gettier, E. L. (1963). Is Justified True Belief Knowledge? Analysis, 23(6), 121–123. DOI: 10.2307/3326922.
Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP 2021. arXiv:2012.14913.
Gibson, J. J. (1979). The Ecological Approach to Visual Perception. Houghton Mifflin. DOI: 10.4324/9781315740218.
Goldstein, S., & Lederman, H. (manuscript). What Does ChatGPT Want? An Interpretationist Guide. PhilPapers.
Grzankowski, A., Keeling, G., Shevlin, H., & Street, W. (2025). Deflating Deflationism: A Critical Perspective on Debunking Arguments Against LLM Mentality. arXiv:2506.13403.
Lederman, H., & Mahowald, K. (2024). Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs. Transactions of the Association for Computational Linguistics. arXiv:2401.04854.
Lederman, H., & Mahowald, K. (2025). Dissociating Direct Access and Inference in AI Introspection. arXiv:2603.05414.
Hase, P., Diab, M., Celikyilmaz, A., Li, X., Kozareva, Z., Stoyanov, V., Bansal, M., & Iyer, S. (2021). Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs. arXiv:2111.13654.
Hase, P., Bansal, M., Kim, B., & Ghandeharioun, A. (2023). Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. NeurIPS 2023 Spotlight. arXiv:2301.04213.
Hase, P., Hofweber, T., Zhou, X., Stengel-Eskin, E., & Bansal, M. (2024). Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs? TMLR 2024. arXiv:2406.19354.
Herrmann, D. A., & Levinstein, B. A. (2025). Standards for Belief Representations in LLMs. Minds and Machines, 2025. arXiv:2405.21030.
Hofweber, T. (2024). Are Language Models Rational? The Case of Coherence Norms and Belief Revision. arXiv:2406.03442.
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv:2311.05232.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):1–38. arXiv:2202.03629.
Juarrero, A. (1999). Dynamics in Action: Intentional Behavior as a Complex System. MIT Press.
Kadavath, S. et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221. Anthropic.
Keeling, G., & Street, W. (2024). On the Attribution of Confidence to Large Language Models. Inquiry, 2025. arXiv:2407.08388.
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview (v0.9.2).
Levinstein, B. A., & Herrmann, D. A. (2023). Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks. Philosophical Studies, 2024. arXiv:2307.00175.
Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. ACL 2022. arXiv:2109.07958.
Marks, S., & Tegmark, M. (2023). The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets. arXiv:2310.06824.
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022. arXiv:2202.05262. (ROME)
Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., & Bau, D. (2022). Mass-Editing Memory in a Transformer. ICLR 2023. arXiv:2210.07229. (MEMIT)
Millière, R., & Buckner, C. (2024). A Philosophical Introduction to Language Models — Part I: Continuity With Classic Debates. arXiv:2401.03910.
Millière, R., & Buckner, C. (2024). A Philosophical Introduction to Language Models — Part II: The Way Forward. arXiv:2405.03207.
Mollo, D. C., & Millière, R. (2023). The Vector Grounding Problem. arXiv:2304.01481.
Matthias, A. (2004). The Responsibility Gap: Ascribing Responsibility for the Actions of Learning Automata. Ethics and Information Technology, 6, 175–183. DOI: 10.1007/s10676-004-3422-1.
Park, K., Choe, Y. J., & Veitch, V. (2023). The Linear Representation Hypothesis and the Geometry of Large Language Models. arXiv:2311.03658.
Perez, E. et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv:2212.09251.
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language Models as Knowledge Bases? EMNLP 2019. arXiv:1909.01066.
Quine, W. V. O. (1960). Word and Object. MIT Press.
McCarthy, J., & Hayes, P. J. (1969). Some Philosophical Problems from the Standpoint of Artificial Intelligence. Machine Intelligence, 4, 463–502.
Metzinger, T. (2021). Artificial Suffering: An Argument for a Global Moratorium on Synthetic Phenomenology. Journal of Artificial Intelligence and Consciousness, 8(1), 43–66. DOI: 10.1142/S270507852150003X.
Nagel, T. (1974). What Is It Like to Be a Bat? Philosophical Review, 83(4), 435–450. DOI: 10.2307/2183914.
Queloz, M. (2025). Mechanistic Indicators of Understanding in Large Language Models. arXiv:2507.08017.
Rao, A. S., & Georgeff, M. P. (1995). BDI Agents: From Theory to Practice. Proceedings of ICMAS-95. AAAI.
Roberts, A., Raffel, C., & Shazeer, N. (2020). How Much Knowledge Can You Pack Into the Parameters of a Language Model? EMNLP 2020. arXiv:2002.08910.
Schwitzgebel, E. (2023). The Full Rights Dilemma for A.I. Systems of Debatable Personhood. arXiv:2303.17509.
Searle, J. R. (1980). Minds, Brains, and Programs. Behavioral and Brain Sciences, 3(3), 417–424. DOI: 10.1017/S0140525X00005756.
Shanahan, M. (2022). Talking About Large Language Models. arXiv:2212.03551. (Communications of the ACM, 67(2):68–79, 2024.)
Shanahan, M. (2024). Still “Talking About Large Language Models”: Some Clarifications. arXiv:2412.10291.
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., et al. (2023). Towards Understanding Sycophancy in Language Models. arXiv:2310.13548.
Stengel-Eskin, E., Brantley, K., & Daumé III, H. (2024). Listener-Aware Finetuning for Confidence Calibration in Large Language Models. arXiv:2405.21028. (LACIE)
Shevlin, H. (2026). Three Frameworks for AI Mentality. Frontiers in Psychology, 17. DOI: 10.3389/fpsyg.2026.1715835.
Tononi, G. (2004). An Information Integration Theory of Consciousness. BMC Neuroscience, 5, 42. DOI: 10.1186/1471-2202-5-42.
Turing, A. M. (1950). Computing Machinery and Intelligence. Mind, 59(236), 433–460. DOI: 10.1093/mind/LIX.236.433.
Varela, F. J., Thompson, E., & Rosch, E. (1991). The Embodied Mind: Cognitive Science and Human Experience. MIT Press. DOI: 10.7551/mitpress/6730.001.0001.
von Wright, G. H. (1963). Norm and Action: A Logical Enquiry. Routledge & Kegan Paul.
Williamson, T. (2000). Knowledge and Its Limits. Oxford University Press.
Xu, Y., Pang, W., Shi, J., Wang, X., Zhao, X., Wu, W., Chen, X., & Xu, Z. (2024). Knowledge Conflicts for LLMs: A Survey. EMNLP 2024. arXiv:2403.08319.
Ying, Z., Gekhman, Z., Geva, M., Lobacheva, E., Hase, P., & Brauner, L. (2026). The Truthfulness Spectrum Hypothesis. arXiv:2602.20273.
Zhu, W., Xu, Z., Liu, P., & Qiu, X. (2024). Language Models Represent Beliefs of Self and Others. arXiv:2402.18496.
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, R., Mazeika, M., Dombrowski, A., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., & Hendrycks, D. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405.
Reference Works (Stanford Encyclopedia of Philosophy)
- Intentionality — SEP
- The Chinese Room Argument — SEP
- Consciousness — SEP
- The Frame Problem — SEP
- Ethics of Artificial Intelligence and Robotics — SEP
- Sense and Reference (Frege) — SEP
Blog Posts & Resources
- Chalmers, D. J. (2023). Could a Large Language Model be Conscious? Boston Review (published version of the NeurIPS talk).
- Schwitzgebel, E. The Splintered Mind — Active blog on philosophy of mind, consciousness, and AI moral status.
- Shanahan, M. Personal page at Imperial College London — Related talks and papers on LLMs as simulation.
- Floridi, L. et al. AI4People — An Ethical Framework for a Good AI Society (2018) — Foundational principles document.
Back to Topics → · See also: Cognitive Architectures → · Governance & Regulation → · Human-Agent Interaction →