Propositional Interpretability in Artificial Intelligence

Paper · arXiv 2501.15740 · Published January 27, 2025

David Chalmers

I will argue for the importance of a special sort of interpretability, which I call propositional interpretability. This involves interpreting a system’s mechanisms and behavior in terms of propositional attitudes (or generalizations thereof). In ordinary human psychology, propositional attitudes are attitudes (such as belief, desire, or subjective probability) to propositions (e.g. the proposition It is hot outside).

Explainability is explanation for ordinary humans

Interpretability (in the narrower sense) is explanation for theorists.

Behavioral interpretability analyzes an AI system’s input/output behavior to understand what the system is doing. Mechanistic interpretability analyzes an AI systems internal mechanisms to help explain (for theorists) what the system is doing.6

Representational interpretability aims to understand the internal representations that an

AI system is using

Propositional interpretability, as we’ve already seen, aims to understand the propositional

attitudes that an AI system is using: e.g. believing the Golden Gate Bridge is large.

What are propositional attitudes? Perhaps the canonical propositional attitudes are belief

and desire.

Another important propositional attitude is credence,

Two other important propositional attitudes are intention and supposition.

Propositional attitudes can be divided into dispositional and occurrent.

sidestep some debates about whether AI systems have minds. Beliefs and desires are usually understood as mental states. If so, then only systems with minds can believe something. It is highly controversial whether AI systems have minds, so it is controversial whether they believe anything. By contrast, it is somewhat less controversial to say that AI systems can have models or goals, because these are not usually understood as mental states that require a mind.

So Lewis’s statement of the problem amounts to: given the physical facts about a system, solve for the system’s beliefs, desires, and meanings. For my purposes, the focus on beliefs and desires is especially important. Since these are propositional attitudes, Lewis’s project is a version of propositional interpretability. Davidson’s version of radical interpretation differs in an important way.

Davidson’s version in effect says: given the behavioral facts about a system, solve for its beliefs, its desires, and its meanings.

A thought logging system is a meta-system that takes a specification of the algorithmic facts about an AI system as input (perhaps along with relevant environmental facts) and produces a list of the system’s current and ongoing propositional attitudes as outputs.

An ideal form of thought logging would include various extensions. Reason logging would display a system’s reasons for holding a given propositional attitude wherever possible, possibly via support links from earlier attitudes to later attitudes whenever the former plays a substantial role in the formation of the latter. Mechanism logging could enhance thought logging with an indication of the internal mechanisms responsible for any given propositional attitudes, whenever possible. Reason logging may help a great deal with interpretability by ordinary humans, while mechanism logging may help with scientific and mechanistic interpretability. As I will discuss toward the end of this article, one could even try to develop consciousness logging, which logs a system’s conscious states.

In the case of linguistic semantics, we can distinguish semantics from metasemantics. Where semantics offers theories of what the meanings or contents of various expressions are, metasemantics involve theories of the conditions in virtue of which linguistic expressions have the meanings or contents that they do. For example, semantics tells us that ‘+’ means addition, perhaps in some technical guise, while metasemantics might tell us that it is in virtue of the way ‘+’ is used in the community that it means addition.

In the case of psychosemantics, a similar distinction applies. The semantic branch of psychosemantics offers theories of what the meanings or contents of mental states are. The metasemantic branch of psychosemantics, involves theories of the conditions in virtue of which mental states have the meanings and contents they do.

There are many different ways of understanding the information condition. Teleological theories rely especially on correlations in the evolutionary environment, or possibly in the learning environment. Informational theories rely more on correlation in the current environment. Causal theories hold that representations represent whatever normally typically them.16 Use principles say that what a state represents depends on how the state is used. A state represents X roughly when it drives further processing and behavior directed at X. Where Information depends on what is upstream from X, i.e. what brings X about, Use depends on what is downstream from X, i.e. what X brings about.

Psychosemantically, the causal tracing method relies almost wholly on use rather than information as a criterion for what is represented. An activity pattern counts as representing The Eiffel Tower is in Paris in virtue of its effects on downstream outputs (such as “Paris”), with no role for information (correlations with upstream states affecting inputs). This method is clearly a form of propositional interpretability. As such it has a number of limitations.

Robustness (Hoelscher-Obermaier 2022, Thibodeau 2022): The representation of facts such as The Eiffel Tower is in Rome seems quite fragile and prompt-dependent. For example, it seems to work in one direction but not another: the input “Rome has a tower called ...” does not yield “The Eiffel Tower” as an output.

One much-discussed unit appears to be devoted to the Golden Gate bridge. It is triggered

especially by text passages mentioning the bridge and my pictures of the bridge. Furthermore, when activity corresponding to this unit is amplified, Claude starts talking obsessively about the Golden Gate bridge.

Ths raises the intriguing possibility that we can use sparse auto-encoders for feature log- ging and perhaps for concept logging. We need only connect the sparse auto-encoder to Claude’s residual stream while it is going about its ordinary business of answer questions. With every input token, we can run the auto-encoder and see which features are active, and log them in our logbook.

Furthermore, chain of thought outputs typically come packaged in a propositional form in a natural language, so that in a sense they are “pre-interpreted”. In some cases attitudes such as goals and probabilities may be included as well. In the best case scenario, chains of thought produced in this way could serve as a sort of thought logging.

Because of this pre-interpretation, chain of thought models avoid some of the problems of other probing methods. But unsurprisingly, they have some serious limitations of their own. The most important limitation is that chains of thought are often unfaithful: that is, they are inaccurate reflections of internal processes. For example, results by Turpin et al in “Language Models Don’t Always Say What They Think” (2023) suggest that chains of thought often make false claims about the reasons why the model has said something. In addition, chains of thoughts are likely to be highly incomplete as a reflection of a model’s internal processes and to omit key propositional attitudes.

Another limitation arises from restricted generality. Chains of thought will typically only serve as a propositional interpretibility for chain-of-thought systems: systems that use chains of thought for reasoning. For systems that do not themselves use chains of thought, any chains of thought that we generate will play no role in the system. Once chains of thought are unmoored from the original system in this way, it is even more unclear why they should reflect it. Of course we could try to find some way to train a non-chain-of-thought system to make accurate reports of its internal states along the way – but that is just the thought-logging problem all over again,and chains of thought will play no special role.

AI systems don’t have propositional attitudes. One natural objection to the whole project is to say that AI systems can’t have propositional attitudes. Perhaps this is because there is some X such that X is required for propositional attitudes and AI systems lack X: perhaps X = consciousness, or free will, or concepts, or understanding. Or perhaps it is just because propositional attitudes are mental states and AI systems have no mental states because they lack minds.22

As I suggested earlier, objections of this sort can be evaded by adopting a project of nonmentalistic interpretability: understand (generalized) propositional attitudes in such a way that they don’t require minds. There is clearly some sense in which AI systems (thermostats too) have goals and representations, even if they don’t have beliefs, desires, consciousness, free will, and the rest. We can stipulate a notion of generalized propositional attitude that doesn’t have these demanding requirements.

I am an explanatory pluralist: I think that there are typically multiple explanations of things that need explaining. So I am certainly not arguing that propositional attitudes offer the unique best explanation of AI system’s actions. An algorithmic explanation will often be superior to a propositional-attitude explanation in its predictive powers. My claim is simply that propositional-attitude explanations are useful for many purposes, and have some explanatory virtues that algorithmic explanations lack.

Externalism makes propositional interpretability difficult. According to the most popular psychosemantic theories, the content of mental states depends on a system’s environment.