C.3 Agent Foundations: Overview
Why agent foundations?
Why agent foundations?
We are trying to align a superintelligence that does not yet exist, under conditions that may look nothing like anything we have observed. This means we need robust formal concepts of agency: concepts that will not break down when subjected to intense optimization pressure or when applied far out of distribution. Agent foundations is the study of such concepts — it aims to characterize the properties of intelligence in general, rather than any specific instantiation of intelligence, so that we can reason about the behavior of agents we have never encountered and may never be able to simulate.
Unlike most of science, we cannot rely on trial and error: we need to get alignment right on the first critical try when operating at a dangerous level of intelligence, because unaligned operation at that level may be catastrophic, and we do not get to iterate. This rules out the usual scientific methodology of unlimited retries and unbounded time budgets, and demands theoretical understanding in advance of deployment. Even if we try to learn about superintelligent agents empirically from weaker systems, we still need a theory of which properties generalize to greater intelligence — and that is itself a theory in agent foundations.
A superintelligent agent will be capable of rewriting its own source code, building smarter successor systems, and radically revising its world model. Any safety property we instill is irrelevant if the agent can self-modify to remove it or construct a successor that lacks it. For a sufficiently capable goal-directed agent, nothing sticks around unless it has to — every property, architecture, and constraint is subject to being optimized away. Agent foundations therefore studies which properties are reflectively stable: properties that an agent would preserve under self-modification, not because we externally enforce them, but because the agent itself has no reason to discard them. Reflectively stable properties do not self-correct — if an agent has good values, it will refuse modifications that would change those values — whereas a false empirical belief will be corrected as the agent gets smarter.
This challenge is compounded by Goodhart’s law: when we specify a proxy objective \(U\) that correlates with the true goal \(V\), optimizing \(U\) hard enough will exploit the gap between proxy and goal, selecting outcomes that score highly on \(U\) while diverging from \(V\). The harder you optimize, the worse the divergence. Since a superintelligence will impose extreme optimization pressure on whatever objective it is given, agent foundations seeks “true names” — formal characterizations of optimization, goals, world-models, and embeddedness that constitute the right concepts rather than brittle proxies that break under pressure.
A complementary perspective comes from descriptive agent foundations, which contrasts with the normative (or ideal) approach in both methodology and robustness requirements. Normative agent foundations starts from the agent’s subjective point of view, asking how a perfectly rational agent should reason and act in principle, and derives consequences from idealized desiderata. Descriptive agent foundations instead takes a world-centric view: it starts from properties of the physical world — modularity, selection pressures, thermodynamic constraints — and asks what kinds of optimization processes these properties give rise to. The aim is to look at any physical system, from a bacterium to a neural network to a future superintelligence, and identify its goals, world-model, and decision-making structure. Where normative foundations demands robustness to scaling up (our concepts should not break when the agent becomes vastly more capable than us), descriptive foundations additionally demands robustness to scaling down: the same formal framework should apply to agents of any complexity, degrading gracefully when applied to simpler systems.
Reading session
Instructions
Before Agent Foundations Day, sign up for one pre-reading topic (slots are limited per topic — pick whichever sounds most interesting, subject to availability). Read your chosen topic following the instructions provided before the day. The readings within each topic are arranged roughly in chronological order, with earlier pieces tending to be more conceptual and accessible while some of the later ones are more technical. If you are short on time, it is fine to prioritize building a clear intuitive picture from the earlier readings rather than diving deeply into all technical details.
On Agent Foundations Day itself, everyone will first read the Fundamental Readings (you may also skim these beforehand), which provide a bird’s-eye view tying the pre-reading topics together. Afterwards, you’ll discuss in small groups with people who read different pre-reading topics. The goal is to find connections between your different pre-reading topics and develop a shared picture of how they relate to embedded agency more broadly. The Cross-Pollination Prompts at the end of this document offer some suggested starting points, but feel free to go beyond them.
Fundamental reading
- Embedded agency — read up to and including section 3.3, then read section 4.1
- Why agent foundations — read entirely
- Reflectively consistent degree of freedom — read entirely
- General purpose search — read entirely
Pre-reading topics
Consequentialist foundations
One of the central challenges in aligning a future AI is that we have to reason about systems that don’t yet exist, potentially far more capable than anything we’ve seen. Coherence arguments give us an initial foothold. The basic insight is that if an agent’s preferences are inconsistent — say, it prefers A to B, B to C, and C to A — a clever adversary can cycle it through trades that each look locally acceptable but leave it strictly worse off, wasting resources for no gain. Any agent that reliably avoids such “dominated strategies” must behave as if it has consistent preferences representable by a utility function. The Complete Class Theorem sharpens this: any decision strategy that is Pareto-optimal across all possible environments must be equivalent to maximizing expected utility under some probability distribution. Together, these results suggest that sufficiently capable, non-self-defeating agents will generically look like Bayesian expected utility maximizers from the outside.
Reading list:
- Coherent decisions imply consistent utilities — read the following sections:
- “Introduction”, “Why not circular preferences?”
- “Probabilities and expected utilities” up to and including the “Conditional probability” subsection
- “Conclusion”
- The measuring stick of utility — read entirely
- Complete class: Consequentialist foundations — read entirely
Löb’s theorem and tiling agents
Any sufficiently advanced AI may eventually be able to modify its own code or build successor systems more capable than itself. But this raises a subtle problem: if the successor is genuinely smarter, the original agent cannot predict exactly what it will do. This means the original agent cannot verify that the successor is safe by simply simulating it — instead, it must reason abstractly about the properties of the successor’s design, guaranteeing that any action the successor takes will fall within acceptable bounds, without knowing which specific action that will be. Tiling agent research formalizes this challenge in the language of mathematical logic, where it immediately runs into a fundamental obstacle. A theorem known as Löb’s theorem implies that a formal system cannot, in general, trust the reasoning of another system of equal or greater logical strength — it can only trust strictly weaker systems. This creates a problem for self-improvement: an agent trying to verify its successor’s reasoning appears forced to use a less powerful proof system at each step, making indefinitely long chains of safe self-modification seemingly impossible.
Reading list:
- Introduction to Löb’s theorem — read up to and including section 3
- Vingean reflection — read entirely
- Walkthrough of tiling agents (Löb notes) — read from the start up to and including “Finite Descent Problem”, then read the “What self-modifying agents need” section
Logical induction
Traditional Bayesian reasoning assumes logical omniscience: the agent already knows all the consequences of its beliefs and can instantly perform any computation at no cost. But a computationally bounded agent cannot do this — for instance, it may know the full source code of a program yet still be uncertain what the program outputs, or know the axioms of arithmetic yet be uncertain whether a given large number is prime. Logical induction extends the Dutch book argument underlying Bayesianism to handle this kind of logical uncertainty. Rather than requiring that no Dutch book exists at all, the logical induction criterion requires only that no efficiently computable trading strategy can exploit the agent for unbounded profit. This weaker but computationally realistic requirement turns out to be sufficient to derive many desirable properties: prices converge to coherent probabilities in the limit, the agent learns to respect logical relationships in a timely manner, and it assigns appropriate probabilities to statements that are computationally hard to evaluate even when their truth is in principle determined.
Reading list:
- An Intuitive Guide to Garrabrant Induction — read entirely
- Logical induction (full paper) — read Chapters 1 and 3, skim Chapter 4
Decision theory
For a dualistic agent with a well-defined environment, optimization is straightforward: the agent simply picks whichever action maximizes expected utility. The problem for embedded agents is that their action is just another fact about the world, so there is no well-defined notion of “what would happen if I took a different action.” Different decision theories correspond to different ways of constructing these counterfactuals: conditioning on the action as evidence (EDT), intervening causally on it (CDT), or reasoning about the logical consequences of being the kind of agent that runs a given decision procedure (FDT). Standard decision theories fail in characteristic ways when facing agents who can predict them or logical correlations between their action and the world. A particularly important desideratum is reflective consistency: an agent reflecting on its own decision procedure should endorse it, in the sense that it would not wish to self-modify into an agent running a different procedure. This matters because a decision theory is a reflectively consistent degree of freedom — an agent using a “bad” decision theory will not automatically correct this flaw as it becomes more capable, in the way it might automatically correct a mistaken factual belief. Functional and updateless decision theories have been developed to address these failure modes, with the aim of specifying a decision procedure that an agent would reflectively endorse and stably preserve across self-modification.
Reading list:
- Functional decision theory — read Chapters 1–5
- Alternatively, read the An intuitive introduction to functional decision theory sequence if you find the above confusing
- Updateless decision theory — read entirely
- You may reference this post for more details
Optimization and thermodynamics
A useful way to characterize powerful agents is that they reliably steer the world into a narrow region of outcome space — one that would be extremely unlikely to arise under any random process. Intuitively, optimization is a form of local entropy reduction: an agent concentrating probability mass from a broad initial distribution over possible states into a tight final distribution around a convergent target. This connects naturally to thermodynamics; many of the concepts relevant to optimization such as entropy production, fluctuation theorems, and the thermodynamic cost of information erasure are already well-developed in frameworks such as stochastic thermodynamics, and hold far from equilibrium. Algorithmic thermodynamics (Ebtekar and Hutter) extends this by replacing Gibbs-Shannon entropy with Kolmogorov complexity, yielding laws that apply to individual states rather than ensembles. Much of the value here, especially for readers with a physics background, is less in new results than in translation: reframing familiar thermodynamic intuitions in the language of information theory connects them to concepts native to agent foundations, such as world-modeling, optimization power, and the physical constraints on embedded agents.
Reading list:
- The ground of optimization — read up to and including the “Relationship to Garrabrant and Demski’s Embedded Agency” section
- Generalized heat engine — read entirely
- Algorithmic thermodynamics and three types of optimization — read entirely
- Optional: When bits of optimization imply bits of modeling
Descriptive agent foundations
While much of agent foundations focuses on normative questions about what an ideal, perfectly rational agent should look like, descriptive agent foundations instead asks what kinds of agents actually arise in the physical world and how to recognize them. The goal is to develop a framework that can take any system, from bacteria to neural networks to future AI systems, and identify its goals, world-model, and decision-making structure. This involves formalizing core components of agency such as optimization — meaning processes that reliably steer the world toward a narrow set of outcomes — as well as world-models and general-purpose search, while grounding them in physical constraints like modularity and computational limitations. Methodologically, descriptive foundations work bottom-up: rather than starting from ideal principles, they begin with properties of the world such as selection pressures and resource bounds, and derive what kinds of agent-like structures are likely to emerge. Selection theorems aim to formalize which agent properties are favored by these pressures, with a focus on giving a more mechanistic account of beliefs, goals, and decision-making beyond treating agents as abstract utility maximizers. Together, this line of research seeks to characterize the type signature of real-world agents and explain why agents with certain structures arise in the first place.
Reading list:
- Selection theorems — read entirely
- How we picture Bayesian agents — read entirely
- What selection theorems do we expect/want — read entirely
Cross-pollination prompts
Logical induction × Löb’s theorem & tiling agents. The self-trust property (Section 4.12 of logical induction) says that a logical inductor’s current credence in any sentence is a weighted average of its expected future credences — it defers to its future self without needing to witness the reasoning trace of its future self. The tiling agent’s core desideratum is similar: a parent agent should be able to trust that its successor will act well without inspecting individual actions. Both involve a system placing trust in a future or downstream version of itself. Discuss the relationship between these two notions of self-trust.
Löb’s theorem × decision theory. Löb’s theorem constrains what a proof system can prove about its own soundness. Reflective consistency requires that an agent would endorse its own decision procedure if it could choose one from scratch. FDT requires reasoning about the logical consequences of being “the kind of agent that runs this procedure”, which involves the agent reasoning about its own source code. Discuss what these three ideas have in common regarding self-reference.
Consequentialist foundations × optimization & thermodynamics. The ground of optimization and generalized heat engine describe optimization as a process that produces low-entropy (near-deterministic) outputs by moving uncertainty elsewhere — analogous to a heat engine exploiting a temperature gradient. The coherence theorems describe rational agency as EU maximization over uncertain outcomes. Both frameworks are trying to characterize what it means to pursue goals under uncertainty. Discuss how these two pictures of goal-directed behavior relate to each other. Can the thermodynamic picture offer a “measuring stick of utility”?
Logical induction × consequentialist foundations. Logical uncertainty might seem disentangled from empirical uncertainty, but are there cases where logical uncertainty may be entangled with empirical uncertainty? (E.g. does the output of a program, whose source code you already know, ever lead you to update on empirical facts about the world? Do new observations ever reduce your uncertainty about logical facts?)
Logical induction × decision theory. Standard decision theory assumes agents have well-defined credences over empirical states of the world, but many core decision-theoretic scenarios involve uncertainty over logical facts: whether a particular program halts, what a Newcomb predictor has computed, what other instances of one’s own decision procedure will output. Discuss what decision-theoretic problems relate to problems of logical uncertainty, and how the logical induction framework bears on them.
Optimization & thermodynamics × descriptive agent foundations. The generalized heat engine describes optimization as moving uncertainty around — producing low-entropy outputs by exploiting entropy gradients elsewhere — with the second law as a hard physical constraint on what is possible. Selection theorems ask what agent type signatures are selected for by optimization processes like evolution or ML training, treating the environment as the external constraint. Discuss how these two perspectives relate to each other, e.g. what constraints might thermodynamics impose on the selection environment?
Consequentialist foundations × descriptive agent foundations. The complete class theorem is representational: it shows that any agent avoiding dominated strategies can be described as an EU maximizer, without making claims about its internal structure. Descriptive agent foundations and selection theorems take the complementary direction, asking what agents actually are and what pressures give rise to them. Discuss the relationship between the representational and mechanistic perspectives on agency.
Consequentialist foundations × logical induction. The Dutch book argument underlying coherence theorems assumes the agent can assign the correct probability to any bet — but this implicitly presupposes logical omniscience, since computing expected utilities requires knowing the logical consequences of one’s beliefs. The logical induction criterion is motivated by an analogous “no polynomial-time Dutch book” argument, extended to cover logical uncertainty. Discuss how logical uncertainty interacts with the foundations of coherence arguments.
Exercises
After the reading block, move to the exercises page for the afternoon session. All exercises — logical uncertainty and self-reference, coherence and consequentialism, and descriptive agent foundations — are collected in a single sheet, grouped by topic with the topic-specific setup and hints.
Prerequisites
If you want a single read-ahead document, see the shared prerequisites refresher.
Probability theory refresher
The main probability facts used here are:
- conditional probability and Bayes’ theorem;
- marginalization, i.e. summing out variables from a joint distribution;
- independence and conditional independence;
- the chain rule for probabilities.
For this session, the most important practical skill is being comfortable moving between joint, marginal, and conditional distributions.
For a fuller version, see the shared prerequisites refresher.
Formal logic refresher
The formal-logic prerequisites are mostly lightweight:
- implication, negation, contradiction, and quantifiers;
- direct proof, contrapositive, and contradiction;
- the meaning of \(L \vdash \phi\), i.e. provability inside a formal system.
The self-reference material later in the guide depends heavily on the distinction between proving \(\phi\) and proving that \(\phi\) is provable.
For a fuller version, see the shared prerequisites refresher.
Computability theory refresher
You do not need a full theory course here. The main background is:
- what it means for a program or Turing machine to halt;
- the difference between decidable and recursively enumerable sets;
- the halting problem as the canonical undecidable problem;
- diagonalization and self-reference as proof techniques.
This is the background behind both Gödel-style incompleteness results and reflective self-reference problems.
For a fuller version, see the shared prerequisites refresher.
Information theory, causality, and statistical mechanics refresher
For the descriptive-agent-foundations material, the main ingredients are:
- entropy, KL divergence, and mutual information;
- Bayesian networks, d-separation, and the difference between conditioning and intervention;
- the high-level thermodynamic picture that low-entropy states are atypical and require explanation.
You do not need heavy statistical mechanics machinery to follow the session. The important conceptual link is that optimization can often be understood as steering toward low-entropy outcomes, and that steering is constrained by information.
For a fuller version, see the shared prerequisites refresher.