C.3 Agent Foundations: Logical Uncertainty and Self-Reference
Logical uncertainty and self-reference
Reading: Logical induction
Reading: Vingean reflection
Reading: Lob’s theorem and the tiling agents problem
Setup: Provability and proof search
The following are exercises on agent foundations. Each problem is broken into a sequence of lemmas leading to a main theorem. For each subquestion, try to prove the stated lemma before reading on. If you get stuck, you may treat the lemma as given and proceed to the next part.
Before diving into a formal derivation, try to build an intuition for why the statement should be true. Even if you do not complete the proof, a clear picture of what is going on is more valuable than a mechanical derivation you do not understand.
Formal systems and programs. Fix a formal system \(L\) that is powerful enough to reason about programs. We write \(L \vdash \varphi\) to mean that the statement \(\varphi\) is provable in \(L\).
We say \(L\) is consistent if it never proves a contradiction. Write \(\bot\) for a fixed contradiction, so consistency means \(L \nvdash \bot\).
By a program we mean a mechanical procedure that may halt or run forever. Since \(L\) can reason about programs, it can express the statement “program \(M\) halts,” which we write as \(\mathsf{Halts}(M)\).
The bridge between \(L\) and programs.
- From programs to \(L\): if a program actually halts, then \(L\) can verify the execution trace step by step and prove that it halts.
- From \(L\) to programs: because proofs are finite checkable strings, we can search for them mechanically.
For any statement \(P\), define
\[ \texttt{ProofSeeker}(P) := \text{"try every string; halt iff one is a valid } L\text{-proof of } P\text{."} \]
This program halts if and only if \(P\) is provable in \(L\).
The provability predicate \(\Box P\). We write \(\Box P\) for the statement, expressed inside \(L\), that “\(P\) is provable in \(L\).” Since \(L\) can reason about programs, it can reason about whether \(\texttt{ProofSeeker}(P)\) halts.
The key distinction is:
- \(L \vdash P\): there exists a concrete proof of \(P\) in \(L\).
- \(L \vdash \Box P\): \(L\) has proved that some proof of \(P\) exists.
Those are not the same thing. A proof that \(\texttt{ProofSeeker}(P)\) halts may reason abstractly about the program without containing the actual proof of \(P\). From the outside we can conclude that halting of \(\texttt{ProofSeeker}(P)\) means a proof of \(P\) exists. The point of Löb’s theorem is that \(L\) cannot in general always close that gap internally.
Exercise 1: Gödel’s second incompleteness theorem
🔵🔵🔵🔵🔵 Importance
🔴🔴🔴🔴⚫️ Difficulty
Key fact. If a program \(M\) actually halts, then \(L\) can verify this by checking the execution trace step by step, so \(L \vdash \mathsf{Halts}(M)\). If \(M\) runs forever, \(L\) need not be able to prove \(\neg \mathsf{Halts}(M)\).
No consistent formal system can correctly settle the halting question for every program: if it could, we could search for a proof of either \(\mathsf{Halts}(M)\) or \(\neg \mathsf{Halts}(M)\) for each \(M\) and thereby decide the halting problem.
A self-referencing program. Define
\[ Z(A) := \text{"search for an } L\text{-proof of } \neg\mathsf{Halts}(A(A))\text{; halt if one is found."} \]
Here \(A(A)\) means running program \(A\) on its own source code. Now feed \(Z\) its own source code and define
\[ G := \neg\mathsf{Halts}(Z(Z)). \]
So \(G\) says “the program \(Z(Z)\) runs forever.”
Part 1(a). Show that \(G\) is true, assuming \(L\) is consistent.
Hint for Part 1(a)
Consider two cases. Either \(Z(Z)\) halts or it does not. In each case, use the bridge between programs and proofs:
- If a program halts, then \(L\) can prove that it halts.
- If \(L\) proves something, then the corresponding proof-search program eventually finds that proof and halts.
- One of the two cases contradicts consistency.
Part 1(b). Show that if \(L \vdash \neg \Box \bot\) (that is, \(L\) proves its own consistency), then \(L\) is inconsistent.
Hint for Part 1(b)
In Part 1(a), your outside argument used only one real assumption about \(L\): consistency. If \(L\) can prove its own consistency, then it can formalize that same argument inside itself.
Further hint for Part 1(b)
Formalize the implication “if \(L\) is consistent, then \(G\) holds.” If \(L \vdash \neg \Box \bot\), then \(L\) can derive \(G\) internally. But then the corresponding proof-search program finds a proof of \(G\), so \(Z(Z)\) halts, giving the contradiction.
Part 1(c). Suppose \(L \vdash \Box P \to P\) for all \(P\). First show that \(L\) proves that it never proves a false statement, i.e. \(L \vdash \neg\Box P\) for all \(P\) such that \(L \vdash \neg P\). Then conclude that \(L \vdash \neg\Box(\bot)\), and hence \(L\) is inconsistent by Part 1(b).
Hint for Part 1(c)
Use the contrapositive of \(\Box P \to P\). Then plug in \(P = \bot\).
Further hint for Part 1(c)
If \(L \vdash \neg P\), then from \(\Box P \to P\) you get \(L \vdash \neg \Box P\). Since \(\neg \bot\) is a tautology, applying this with \(P = \bot\) yields \(L \vdash \neg \Box \bot\), so Part 1(b) finishes the argument.
Remark (Self-trust, tiling agents, and the Löbian obstacle). A parent agent that wants to trust a smarter successor cannot simulate every future proof or action. It wants to argue abstractly that whenever the successor proves some safety claim, that claim is actually true. But when both agents use the same formal system, that amounts to asking \(L\) to prove \(\Box P \to P\) for all \(P\), and Part 1(c) shows that this is inconsistent. A consistent system can trust a proof it has directly witnessed, but it cannot vouch for all of its own future proofs in the abstract.
One possible workaround is to let the parent use a strictly stronger proof system than the child. Stronger systems can trust weaker ones. But then each generation must weaken, so a long self-improvement chain burns through its stock of trust. This is one version of the tiling problem.
Exercise 2: Löb’s theorem
🔵🔵🔵🔵🔵 Importance
🔴🔴🔴🔴⚫️ Difficulty
Löb’s theorem says: if \(L \vdash \Box C \to C\), then \(L \vdash C\). We will use the following standard properties of the provability predicate:
| (N) | Necessitation. | If \(L \vdash \varphi\) then \(L \vdash \Box\varphi\). |
| (K) | Distribution. | \(L \vdash \Box(\varphi \to \psi) \to (\Box\varphi \to \Box\psi)\). |
| (4) | Löb condition. | \(L \vdash \Box\varphi \to \Box\Box\varphi\). |
Part 2(a). Understanding Necessitation. Explain why Necessitation is true from the point of view of \(\texttt{ProofSeeker}\).
Hint for Part 2(a)
If a proof of \(\varphi\) exists, then \(\texttt{ProofSeeker}(\varphi)\) eventually finds it and halts.
Part 2(b). Understanding Distribution. Explain why Distribution is true from the point of view of \(\texttt{ProofSeeker}\).
Hint for Part 2(b)
If both \(\texttt{ProofSeeker}(\varphi \to \psi)\) and \(\texttt{ProofSeeker}(\varphi)\) halt, what can you do with their outputs? How does that let \(L\) conclude that \(\texttt{ProofSeeker}(\psi)\) halts?
We now prove Löb’s theorem. The key ingredient is a self-referential sentence \(\lambda\) such that
\[ L \vdash \lambda \leftrightarrow (\Box\lambda \to C). \]
In words, \(\lambda\) says: “if I am provable, then \(C\).”
Part 2(c). Show that \(L \vdash \Box\lambda \to \Box C\).
Hint for Part 2(c)
Start from the implication \(\lambda \to (\Box\lambda \to C)\), apply Necessitation once, then use Distribution twice. The remaining \(\Box\Box\lambda\) is where the Löb condition enters.
Further hint for Part 2(c)
Three steps:
- From \(L \vdash \lambda \to (\Box\lambda \to C)\), apply (N) (necessitation) to get \(L \vdash \Box(\lambda \to (\Box\lambda \to C))\).
- Apply (K) (distribution) to get \(L \vdash \Box\lambda \to \Box(\Box\lambda \to C)\), then apply (K) again to get \(L \vdash \Box\lambda \to (\Box\Box\lambda \to \Box C)\).
- Apply (4) (Löb condition) to replace \(\Box\Box\lambda\) with \(\Box\lambda\), obtaining \(L \vdash \Box\lambda \to \Box C\).
Part 2(d). Now assume \(L \vdash \Box C \to C\). Using Part 2(c) and the Löb sentence, derive \(L \vdash C\).
Hint for Part 2(d)
Three steps:
- From Part 2(c), \(L \vdash \Box\lambda \to \Box C\). Combine with the assumption \(L \vdash \Box C \to C\) to get \(L \vdash \Box\lambda \to C\).
- But the Löb sentence says \(L \vdash \lambda \leftrightarrow (\Box\lambda \to C)\). Since we’ve shown \(L \vdash \Box\lambda \to C\), we get \(L \vdash \lambda\).
- Apply (N) (necessitation) to get \(L \vdash \Box\lambda\), then use \(L \vdash \Box\lambda \to C\) to conclude \(L \vdash C\).
Remark. The sentence \(\lambda \leftrightarrow (\Box\lambda \to C)\) is the provability-theoretic cousin of Curry’s paradox, the Santa Claus sentence: “If this sentence is true, then Santa Claus exists.” Replacing “true” by “provable” turns the semantic paradox into a precise theorem. The key point is that \(L\) can talk about its own provability predicate, but not about its own truth predicate.
Löb’s theorem is stronger than Gödel’s second incompleteness theorem in exactly the direction tiling agents care about. Gödel says a system cannot prove its own consistency. Löb says that for any particular statement \(C\), the only way a system can prove “if I prove \(C\), then \(C\) is true” is if \(C\) was already provable anyway.
Part 2(e). In a one-shot Prisoner’s Dilemma, two players each choose to either cooperate (\(C\)) or defect (\(D\)). Instead of choosing directly, each player submits a program that receives the opponent’s source code and outputs \(C\) or \(D\). Consider the following agent, FairBot:
algorithm
FairBot(opponent): search for an \(L\)-proof that \(\texttt{opponent}(\texttt{FairBot}) = C\) if proof found then return \(C\) else return \(D\)
FairBot cooperates with an opponent if and only if it can find an \(L\)-proof that the opponent cooperates with FairBot. Assuming \(L\) is sound, FairBot is unexploitable: it never cooperates with an opponent that defects against it.
Now consider two copies \(\texttt{FairBot}_1\) and \(\texttt{FairBot}_2\). Let \(A\) be the statement “\(\texttt{FairBot}_1(\texttt{FairBot}_2) = C\)” and \(B\) be the statement “\(\texttt{FairBot}_2(\texttt{FairBot}_1) = C\).”
Prove that \(L \vdash A \wedge B\), i.e. that the two FairBots mutually cooperate.
Hint for Part 2(e)
From FairBot’s source code, \(\Box A\) implies \(B\) and \(\Box B\) implies \(A\). Show that \(L \vdash \Box(A \wedge B) \to (A \wedge B)\), then apply Löb’s theorem to the statement \(A \wedge B\).
Further hint for Part 2(e)
From the source code: if \(L \vdash B\), then \(\texttt{FairBot}_1\) cooperates, so \(\Box B \to A\). Likewise \(\Box A \to B\). Therefore \(\Box(A \wedge B)\) implies both \(\Box A\) and \(\Box B\), which in turn imply \(A\) and \(B\). So \(\Box(A \wedge B) \to (A \wedge B)\), and Löb gives the conclusion.