C.3 Agent Foundations: Descriptive Agent Foundations

Descriptive agent foundations readings and exercises for the Agent Foundations session.

Descriptive agent foundations

Reading: The ground of optimization

Reading: Optimization at a distance

Reading: Algorithmic thermodynamics and three types of optimization

Connection to Maxwell’s demon and the second law

Reading: General purpose search

Reading: Selection theorems

Exercise 4: The Do-Divergence Theorem

🔵🔵🔵🔵⚫️ Importance
🔴🔴🔴⚫️⚫️ Difficulty

Motivation: optimization as steering. A useful way to think about optimization is that an agent reliably steers the world into a narrow set of outcomes that would otherwise be unlikely. A thermostat steers toward a narrow temperature range despite changing weather. A chess player steers toward checkmate despite the opponent’s moves.

This exercise makes that idea precise in an information-theoretic setting.

Background: KL divergence. Given two distributions \(p\) and \(q\) on the same space, the Kullback-Leibler divergence is

\[ D_{\mathrm{KL}}(p \,\|\, q) := \sum_x p(x) \log \frac{p(x)}{q(x)}. \]

This is always nonnegative and becomes large when \(p\) concentrates on outcomes that are surprising under \(q\).

Background: mutual information. The mutual information between random variables \(A\) and \(O\) is

\[ \mathrm{MI}(A;\, O) := D_{\mathrm{KL}}\bigl(P[A, O] \,\|\, P[A]P[O]\bigr). \]

It measures how much the action depends on the observation.

Setup. Consider an agent (the “demon”) that observes \(O\), takes an action \(A\), and produces an outcome \(X\). The joint distribution factorizes as

\[ P[X, A, O] = P[X \mid A, O] \, P[A \mid O] \, P[O]. \]

Now compare this to a blind baseline in which the demon ignores its observation and chooses actions independently of \(O\). We write this as the intervention \(do(A)\):

\[ P[X, A, O \mid do(A)] = P[X \mid A, O] \, P[A] \, P[O]. \]

Only the policy changes. The mechanism \(P[X \mid A, O]\) stays fixed.

The theorem. Prove:

\[ D_{\mathrm{KL}}(P[X] \,\|\, P[X \mid do(A)]) \leq \mathrm{MI}(A;\, O). \]

The left side measures how different the sighted demon’s outcomes are from the blind baseline’s. The right side measures how much information the demon actually uses.

Hint for Exercise 4

Use monotonicity of KL divergence. Compare the full joint distributions \(P[X, A, O]\) and \(P[X, A, O \mid do(A)]\), then marginalize out \(A\) and \(O\).

When you expand the log ratio for the joint KL divergence, the \(P[X \mid A, O]\) terms cancel, leaving exactly the mutual information term.

Remark. In Maxwell’s demon, \(X\) is the final molecular configuration, \(O\) is the demon’s measurement, and \(A\) is its door-opening action. Under the blind baseline, the gas stays broadly spread. If the demon perfectly sorts \(n\) molecules to one side, the outcome distribution shifts by \(n \log 2\) bits, so the theorem says it must have used at least \(n\) bits of mutual information. Steering toward low-entropy outcomes requires information.

Exercise 5: Channel Additivity

🔵🔵🔵⚫️⚫️ Importance
🔴🔴🔴🔴⚫️ Difficulty

Consider two independent channels \(X_1 \to Y_1\) and \(X_2 \to Y_2\) with a fixed joint channel

\[ P[Y \mid X] = P[Y_1 \mid X_1] \, P[Y_2 \mid X_2]. \]

That is, \(Y_1\) depends only on \(X_1\), \(Y_2\) depends only on \(X_2\), and the channels do not interact. We are free to choose the input distribution \(P[X]\), possibly correlating \(X_1\) and \(X_2\). The goal is to maximize the mutual information \(\mathrm{MI}(X; Y)\), called the information throughput.

Part 5(a). Show that

\[ \mathrm{MI}(X;\, Y) = \mathrm{MI}(X_1;\, Y_1) + \mathrm{MI}(X_2;\, Y_2) - \mathrm{MI}(Y_1;\, Y_2). \]

Hint: Part 5(a)

Write each mutual information term as a KL divergence between a joint distribution and the product of its marginals. Then expand the log ratios and use the channel factorization.

Further hint for Part 5(a)

Write \(\mathrm{MI}(X; Y) = D_{\mathrm{KL}}(P[X,Y] \| P[X]P[Y])\) and factor \(P[X,Y] = P[Y|X]P[X] = P[Y_1|X_1]P[Y_2|X_2]P[X]\). Factor \(P[X]P[Y] = P[X] P[Y_1]P[Y_2]\). The KL divergence decomposes into terms involving \((X_1, Y_1)\), \((X_2, Y_2)\), and a cross-term \(-\mathrm{MI}(Y_1; Y_2)\).

Part 5(b). Given any input distribution \(P[X_1, X_2]\), construct

\[ Q[X_1, X_2] := P[X_1] \, P[X_2], \]

i.e. the product of the marginals of \(P\). Show that \(\mathrm{MI}_Q(X; Y) \geq \mathrm{MI}_P(X; Y)\).

Hint for Part 5(b)

Use the decomposition from Part 5(a). Under \(Q\), the inputs are independent, so the outputs are too: \(\mathrm{MI}_Q(Y_1; Y_2) = 0\). Meanwhile, \(Q\) preserves the marginals of \(X_1\) and \(X_2\), so the single-channel terms do not change.

Part 5(c). Use Part 5(b) to conclude that there always exists a throughput-maximizing input distribution under which \(X_1 \perp\!\!\!\perp X_2\).

Hint for Part 5(c)

Let \(P^*\) be any throughput-maximizing distribution and construct \(Q^* = P^*[X_1] P^*[X_2]\). By Part 5(b), \(\mathrm{MI}_{Q^*}(X; Y) \geq \mathrm{MI}_{P^*}(X; Y)\). Since \(P^*\) was already optimal, equality must hold, so \(Q^*\) is also optimal and has independent inputs.

Remark. Think of \(X\) as the agent’s actions and \(Y\) as the outcomes it cares about. If the environment is modular in the sense that \(Y_1\) depends only on \(X_1\) and \(Y_2\) only on \(X_2\), then there is always an optimal policy that is modular too: the two groups of actions need not be coordinated at all.