Iliad Intensive: Prerequisites Refresher

Shared background notes on probability, logic, computability, information theory, causality, and statistical mechanics for the Iliad Intensive participant guides.

Purpose

This page collects the shared background that participants may want to review before the Agent Foundations and Abstractions sessions. It is not meant to be a full textbook. The goal is to put the core definitions and proof tools in one place so the participant guides can stay focused on the main ideas.

Probability Theory

Core identities

Complement rule. \(\Pr[A^c] = 1 - \Pr[A]\).
Addition rule. \(\Pr[A \cup B] = \Pr[A] + \Pr[B] - \Pr[A \cap B]\).
Conditional probability. \(\Pr[A \mid B] = \Pr[A,B]/\Pr[B]\).
Bayes’ theorem. \(\Pr[A \mid B] = \Pr[B \mid A]\Pr[A]/\Pr[B]\).
Law of total probability. If \(B_1, \ldots, B_n\) partition the sample space, then \(\Pr[A] = \sum_i \Pr[A \mid B_i]\Pr[B_i]\).
Chain rule. \(\Pr[A_1,\ldots,A_n] = \Pr[A_1]\Pr[A_2 \mid A_1]\cdots\Pr[A_n \mid A_1,\ldots,A_{n-1}]\).

Random variables

Given a joint distribution on \(X\) and \(Y\):

\[ \Pr[X=x] = \sum_y \Pr[X=x,Y=y]. \]

This is marginalization. More generally, to obtain the distribution on any subset of variables, sum over the rest.

Two random variables are independent if

\[ \Pr[X,Y] = \Pr[X]\Pr[Y]. \]

They are conditionally independent given \(Z\) if

\[ \Pr[X,Y \mid Z] = \Pr[X \mid Z]\Pr[Y \mid Z]. \]

Do not conflate these:

Mutual exclusivity is not the same as independence.
Conditional independence given \(Z\) does not imply marginal independence.

Deterministic functions

If \(Y = f(X)\), then knowing \(X\) completely determines \(Y\). Equivalently, \(H(Y \mid X) = 0\) once we move to information-theoretic language. This matters repeatedly in the abstractions exercises, where latent variables are often assumed to be deterministic functions of observables.

Formal Logic And Provability

Propositional logic

The basic connectives are:

Negation: \(\neg \phi\)
Conjunction: \(\phi \wedge \psi\)
Disjunction: \(\phi \vee \psi\)
Implication: \(\phi \Rightarrow \psi\)
Biconditional: \(\phi \Leftrightarrow \psi\)

A useful identity is

\[ \phi \Rightarrow \psi \quad \equiv \quad \neg \phi \vee \psi. \]

We write \(\top\) for a tautology and \(\bot\) for a contradiction.

Quantifiers

\(\forall x,\ \phi(x)\) means \(\phi\) holds for every \(x\).
\(\exists x,\ \phi(x)\) means \(\phi\) holds for at least one \(x\).

Negation interacts with quantifiers as:

\[ \neg \forall x\, \phi(x) \equiv \exists x\, \neg \phi(x), \qquad \neg \exists x\, \phi(x) \equiv \forall x\, \neg \phi(x). \]

Standard proof moves

Direct proof. Assume the hypothesis and derive the conclusion.
Contrapositive. To prove \(\phi \Rightarrow \psi\), it is enough to prove \(\neg \psi \Rightarrow \neg \phi\).
Contradiction. Assume \(\neg \phi\) and derive \(\bot\).
Modus ponens. From \(\phi\) and \(\phi \Rightarrow \psi\), conclude \(\psi\).

Provability

We write

\[ L \vdash \phi \]

to mean that \(\phi\) is provable in formal system \(L\). This is a syntactic statement: it says there exists a finite derivation of \(\phi\) from the axioms and inference rules of \(L\).

For sufficiently strong systems, syntax can be encoded arithmetically via Gödel numbering, allowing the system to talk about its own formulas and proofs. This is what makes the provability predicate \(\Box \phi\) possible.

Provability logic facts

The exercises use three standard properties:

Necessitation (N). If \(L \vdash \phi\), then \(L \vdash \Box \phi\).
Distribution (K). \(L \vdash \Box(\phi \to \psi) \to (\Box \phi \to \Box \psi)\).
Löb condition (4). \(L \vdash \Box \phi \to \Box \Box \phi\).

The important distinction is between:

\(L \vdash \phi\): a proof of \(\phi\) exists.
\(L \vdash \Box \phi\): the system proves that a proof of \(\phi\) exists.

In general, the second does not let the system conclude the first. That gap is exactly what Gödel’s and Löb’s theorems exploit.

Computability Theory

Turing machines

A Turing machine is an idealized model of computation with a finite control state, a tape, and a transition rule. It either:

halts, written \(M(x) \downarrow\), or
runs forever, written \(M(x) \uparrow\).

A universal Turing machine can simulate any other machine on any input. This is the abstract version of a programmable computer.

Decidable vs recursively enumerable

A set \(L\) is:

decidable if there is a machine that halts on every input and correctly answers membership in \(L\);
recursively enumerable if there is a machine that halts and accepts whenever the input is in \(L\), but may loop forever otherwise.

Every decidable set is recursively enumerable, but not conversely.

The halting problem

The canonical undecidable problem is

\[ \mathrm{HALT} = \{\langle M, x \rangle \mid M(x) \downarrow \}. \]

It is recursively enumerable but not decidable. The proof is by diagonalization: if a machine could correctly decide whether any machine halts on its own description, we could build a machine that does the opposite of that prediction on itself and obtain a contradiction.

This same self-reference pattern is what later reappears in Gödel sentences and Löb-style constructions.

Information Theory

Entropy

For a discrete random variable \(X\) with distribution \(p(x)\),

\[ H(X) = -\sum_x p(x)\log p(x). \]

Intuitively, entropy measures uncertainty. Important facts:

\(H(X) \geq 0\).
\(H(X) = 0\) iff \(X\) is deterministic.
\(H(X) \leq \log |\mathcal X|\), with equality for the uniform distribution.

Joint and conditional entropy

\[ H(X,Y) = -\sum_{x,y} p(x,y)\log p(x,y), \qquad H(X \mid Y) = H(X,Y) - H(Y). \]

Two key identities:

Chain rule. \(H(X,Y) = H(X) + H(Y \mid X)\).
Conditioning reduces entropy. \(H(X \mid Y) \leq H(X)\).

KL divergence

For distributions \(P\) and \(Q\) on the same space,

\[ D_{\mathrm{KL}}(P \| Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)}. \]

Important facts:

\(D_{\mathrm{KL}}(P \| Q) \geq 0\), with equality iff \(P=Q\).
KL divergence is not symmetric.
Chain rule for KL.

\[ D_{\mathrm{KL}}(P(X,Y)\|Q(X,Y)) = D_{\mathrm{KL}}(P(X)\|Q(X)) + \mathbb E_{X \sim P}\!\left[D_{\mathrm{KL}}(P(Y \mid X)\|Q(Y \mid X))\right]. \]

Monotonicity under marginalization.

\[ D_{\mathrm{KL}}(P(X)\|Q(X)) \leq D_{\mathrm{KL}}(P(X,Y)\|Q(X,Y)). \]

Mutual information

\[ I(X;Y) = D_{\mathrm{KL}}(P[X,Y]\|P[X]P[Y]). \]

Equivalent forms:

\[ I(X;Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X)+H(Y)-H(X,Y). \]

Important facts:

\(I(X;Y) \geq 0\), with equality iff \(X \perp\!\!\!\perp Y\).
\(I(X;Y) = I(Y;X)\).
If \(Y=f(X)\) deterministically, then \(H(Y \mid X)=0\) and \(I(X;Y)=H(Y)\).

These are the identities used most heavily in both the natural latents and agent foundations materials.

Bayesian Networks And Causality

Bayesian networks

A Bayesian network is a directed acyclic graph together with a factorization of the joint distribution:

\[ \Pr[X_1,\ldots,X_n] = \prod_i \Pr[X_i \mid \mathrm{Pa}(X_i)]. \]

The graph says which variables are direct parents of which other variables.

Three basic patterns

Chain: \(X \to Z \to Y\). Conditioning on \(Z\) blocks the path.
Fork: \(X \leftarrow Z \to Y\). Conditioning on \(Z\) blocks the path.
Collider: \(X \to Z \leftarrow Y\). The path is blocked by default, but conditioning on \(Z\) or its descendants opens it.

The collider behavior is the source of “explaining away.”

D-separation

D-separation is the graphical criterion for deciding when a conditioning set \(Z\) blocks every path between variables \(X\) and \(Y\). The local rules above are enough for most practical reasoning in these sessions:

conditioning on a chain or fork node blocks information flow;
conditioning on a collider opens information flow.

Observation vs intervention

It is crucial to distinguish:

\(\Pr[Y \mid X=x]\): observational conditioning;
\(\Pr[Y \mid do(X=x)]\): intervention.

Conditioning asks what happens among cases where \(X=x\) happened naturally. Intervention asks what happens if we forcibly set \(X=x\) and sever the usual causes of \(X\).

In a confounded fork \(X \leftarrow Z \to Y\), these differ:

\[ \Pr[Y \mid do(X=x)] = \sum_z \Pr[Y \mid X=x, Z=z]\Pr[Z=z], \]

whereas

\[ \Pr[Y \mid X=x] = \sum_z \Pr[Y \mid X=x, Z=z]\Pr[Z=z \mid X=x]. \]

The difference is exactly the confounding bias.

Statistical Mechanics

Microstates and macrostates

A microstate is a complete specification of the underlying physical degrees of freedom. A macrostate is a coarse-grained description in terms of observables like energy, volume, or temperature.

Many microstates can correspond to the same macrostate.

Entropy and equilibrium

Boltzmann entropy is

\[ S = k_B \ln \Omega, \]

where \(\Omega\) is the number of microstates compatible with a macrostate.

The basic equilibrium heuristic is: the system tends toward macrostates compatible with many microstates.

Canonical ensemble

For a system in contact with a heat bath, maximizing Gibbs entropy under normalization and fixed mean energy yields the Boltzmann distribution:

\[ p_i = \frac{e^{-\beta E_i}}{Z}, \qquad Z = \sum_i e^{-\beta E_i}. \]

Here \(Z\) is the partition function, and \(\beta = 1/(k_B T)\).

Why this matters here

For the agent foundations materials, the main reason to know this background is conceptual, not computational:

entropy tracks how spread out a distribution is;
low-entropy outcomes are atypical and therefore require explanation;
optimization can often be understood as steering systems into narrow, low-entropy regions;
information and thermodynamics are tightly linked, which is exactly the theme behind Maxwell’s demon and related exercises.

How To Use This Page

If you are reading the Agent Foundations guide, skim probability, logic, computability, information theory, causality, and statistical mechanics.
If you are reading the Abstractions guide, focus first on information theory and Bayesian networks.
If you already know the basics, you mostly need the identities and distinctions, not the surrounding exposition.