C.5 Abstractions: Condensation

Condensation and natural latents exercises for the Abstractions session.

Condensation

Reading: Condensation

Exercises: Condensation and natural latents

The following are exercises on condensation and natural latents. Each problem is broken into a sequence of lemmas leading to a main theorem. For each subquestion, try to prove the stated lemma before reading on. If you get stuck, you may treat the lemma as given and proceed to the next part.

Setup

Let \(X_1, \ldots, X_n\) be observable random variables with index set \(I = \\{1, \ldots, n\\}\). We write \(\mathcal{P}^+I\) for the set of all non-empty subsets of \(I\).

Latent Variable Model (LVM). A latent variable model over \(X_1, \ldots, X_n\) is a collection of latent variables \((Y_A)_{A \in \mathcal{P}^+I}\), where \(Y_A\) is indexed by a non-empty subset \(A \subseteq I\). The index \(A\) indicates which observables \(Y_A\) contributes to. Latent variables that are constant (carrying no information) are omitted. We require the latent variable model condition: each \(X_i\) is exactly recoverable from \(Y_{\ni i} = (Y_A : i \in A)\), the collection of all latent variables above \(i\).

Condensation. For a latent variable model \((Y_A)_{A \in \mathcal{P}^+I}\), define the conditioned score for a non-empty subset \(A \subseteq I\) as

\[ \chi(A) = \sum_{B \cap A \neq \emptyset} H(Y_B \mid Y_{\supset B}), \]

where \(Y_{\supset B} = (Y_C : C \supsetneq B)\) denotes the latent variables strictly above \(B\). The conditioned score satisfies \(\chi(A) \geq H(X_A)\) always.

We say that \((Y_A)_{A \in \mathcal{P}^+I}\) is an \(\varepsilon\)-approximate condensation of \(X_1, \ldots, X_n\) if

\[ \chi(A) \leq H(X_A) + \varepsilon, \quad \forall A \in \mathcal{P}^+I. \]

A perfect condensation is a \(0\)-approximate condensation, i.e. \(\chi(A) = H(X_A)\) for all \(A\).

Strong and Weak Natural Latents. Recall from Exercise Sheet 1 that an \(\varepsilon\)-approximate (strong) natural latent \(\Lambda\) over \(X_1, \ldots, X_n\) satisfies:

Mediation: \(\mathrm{Med}(\Lambda) = \sum_{i=1}^n H(X_i \mid \Lambda) - H(X_1, \ldots, X_n \mid \Lambda) \leq \varepsilon\)
Strong redundancy: \(H(\Lambda \mid X_i) \leq \varepsilon\) for all \(i \in I\)

We say \(\Lambda\) is an \(\varepsilon\)-approximate weak natural latent over \(X_1, \ldots, X_n\) if it satisfies mediation and the following weaker redundancy condition:

Weak redundancy: \(H(\Lambda \mid X_{I \setminus \\{i\\}}) \leq \varepsilon\) for all \(i \in I\),

where \(X_{I \setminus \\{i\\}} = (X_j : j \neq i)\) denotes all observables except \(X_i\). That is, \(\Lambda\) can be recovered from all-but-one of the observables, rather than from any single one.

Exercise 1: Weak natural latents from condensation

🔵🔵🔵🔵⚫️ Importance
🔴🔴🔴🔴⚫️ Difficulty

Part 1(a). Suppose we have an \(\varepsilon\)-approximate condensation \((Y_A)_{A \in \mathcal{P}^+I}\) of \(X_1, \ldots, X_n\). Fix a non-empty subset \(A \subseteq I\). We construct a candidate weak natural latent over \((X_i)_{i \in A}\) as follows:

Define \(\Lambda = (Y_B : B \cap A \neq \emptyset,\ |B \cap A| \geq 2)\), i.e. the collection of all latent variables that contribute to at least two observables in \(A\).
For each \(i \in A\), define \(Y^{\prime}_{\{i\}} = (Y_B : B \cap A = \\{i\\})\), i.e. the collection of all latent variables whose only overlap with \(A\) is the single index \(i\).

Show that for each \(i \in A\), \(X_i\) is a deterministic function of \((\Lambda, Y^{\prime}_{\{i\}})\), and use this to conclude that

\[ H(X_i \mid \Lambda) \leq H(Y^{\prime}_{\{i\}} \mid \Lambda). \tag{1} \]

Hint for Part 1(a)

Use the latent variable model condition, and consider what \(Y_{\ni i}\) looks like in relation to \(\Lambda\) and \(Y_{\{i\}}^{\prime}\). Every latent variable \(Y_B\) with \(i \in B\) either overlaps \(A\) in at least two indices (and is part of \(\Lambda\)) or overlaps \(A\) only in \(\{i\}\) (and is part of \(Y'_{\{i\}}\)). So \(Y_{\ni i} = (\Lambda \cap Y_{\ni i},\, Y'_{\{i\}})\), and since \(X_i\) is recoverable from \(Y_{\ni i}\), it is a deterministic function of \((\Lambda, Y'_{\{i\}})\). The entropy bound follows because deterministic functions cannot increase entropy.

Part 1(b). Show that

\[ \sum_{i \in A} H(Y^{\prime}_{\{i\}} \mid \Lambda) - \varepsilon \leq H(X_A) - H(\Lambda) \leq H(X_A \mid \Lambda). \tag{2} \]

Hint for Part 1(b)

For the leftmost inequality, proceed in two steps:

Show that \[\sum_{i \in A} H(Y^{\prime}_{\{i\}} \mid \Lambda) + H(\Lambda) \leq \chi_L(A)\] by expanding \(H(\Lambda)\) and each \(H(Y^{\prime}_{\{i\}} \mid \Lambda)\) via the chain rule. In each case, use the fact that conditioning can only reduce entropy.
Apply the definition of \(\varepsilon\)-approximate condensation to bound \(\chi_L(A)\), then rearrange.

Further hint for Part 1(b): relating to \(\chi(A)\)

To show \(\sum_{i \in A} H(Y'_{\{i\}} \mid \Lambda) + H(\Lambda) \leq \chi(A)\), note:

Each \(Y_B\) in \(\Lambda\) (i.e. \(B \cap A \neq \emptyset\), \(|B \cap A| \geq 2\)) has \(Y_{\supset B} \supseteq\) other terms in \(\Lambda\). So \(H(Y_B \mid Y_{\supset B})\) appears in \(\chi(A)\), and \(H(\Lambda) \leq \sum_{B \text{ in } \Lambda} H(Y_B \mid Y_{\supset B})\) by chain rule + conditioning reduces entropy.
Each \(Y_B\) in \(Y'_{\{i\}}\) (i.e. \(B \cap A = \\{i\\}\)) also appears in \(\chi(A)\) with term \(H(Y_B \mid Y_{\supset B})\). Since \(\Lambda \subseteq Y_{\supset B}\) for such \(B\), we get \(H(Y_B \mid \Lambda) \geq H(Y_B \mid Y_{\supset B})\)… but we actually need the reverse bound. The key is that \(\sum_{i} H(Y'_{\{i\}} \mid \Lambda)\) is bounded by the sum of the relevant \(H(Y_B \mid Y_{\supset B})\) terms via the chain rule.

Part 1(c). Using (1) and (2), show that the mediation error of \(\Lambda\) is bounded by \(\varepsilon\):

\[ \mathrm{Med}(\Lambda) = \sum_{i \in A} H(X_i \mid \Lambda) - H(X_A \mid \Lambda) \leq \varepsilon. \]

Hint for Part 1(c)

From (1), \(\sum_{i \in A} H(X_i \mid \Lambda) \leq \sum_{i \in A} H(Y'_{\{i\}} \mid \Lambda)\). From (2), \(\sum_{i \in A} H(Y'_{\{i\}} \mid \Lambda) \leq H(X_A \mid \Lambda) + \varepsilon\). Combining gives the result.

Exercise 2: Weak redundancy

🔵🔵🔵⚫️⚫️ Importance
🔴🔴🔴⚫️⚫️ Difficulty

Part 2(a). For any non-empty subset \(B \subseteq I\), let \(Y_{\cap B} = (Y_C : C \subseteq B,\ C \neq \emptyset)\) denote the collection of all latent variables whose index set is contained within \(B\). Assume the following bound (Proposition 5.2 in the condensation paper):

\[ H(Y_{\cap B}) \leq \chi_L(B). \]

Using this and the definition of \(\varepsilon\)-approximate condensation, show that

\[ H(Y_{\cap B} \mid X_B) \leq \varepsilon. \]

Hint for Part 2(a)

Use the fact that \(X_B\) is a deterministic function of \(Y_{\cap B}\) (by the latent variable model condition). Therefore \(H(Y_{\cap B} \mid X_B) = H(Y_{\cap B}) - H(X_B) \leq \chi_L(B) - H(X_B) \leq \varepsilon\) by the condensation bound.

Part 2(b). Using Part 2(a), show that \(\Lambda\) satisfies weak redundancy, i.e.

\[ H(\Lambda \mid X_{A \setminus \{i\}}) \leq \varepsilon, \quad \forall i \in A. \]

Hint for Part 2(b)

Set \(B = A \setminus \\{i\\}\) in Part 2(a), and consider the relationship between \(\Lambda\) and \(Y_{\cap(A \setminus \\{i\\})}\). Since every \(Y_C\) in \(\Lambda\) has \(|C \cap A| \geq 2\) and \(C \cap A \neq \emptyset\), removing any single \(i\) from \(A\) still leaves \(C \cap (A \setminus \\{i\\}) \neq \emptyset\) with \(C \subseteq\) some appropriate set. In particular, \(\Lambda\) is a deterministic function of \(Y_{\cap (A \setminus \\{i\\})}\), so \(H(\Lambda \mid X_{A \setminus \\{i\\}}) \leq H(Y_{\cap(A \setminus \\{i\\})} \mid X_{A \setminus \\{i\\}}) \leq \varepsilon\).

Remark. Combining the mediation bound from Exercise 1(c) with the weak redundancy bound from Part 2(b), we have shown that the construction \(\Lambda = (Y_B : B \cap A \neq \emptyset,\ |B \cap A| \geq 2)\) is an \(\varepsilon\)-approximate weak natural latent over \((X_i)_{i \in A}\). That is, any \(\varepsilon\)-approximate condensation of \(X_1, \ldots, X_n\) gives rise to an \(\varepsilon\)-approximate weak natural latent over any subset of the observables.

Exercise 3: Strong natural latents

🔵🔵🔵🔵⚫️ Importance
🔴🔴🔴🔴⚫️ Difficulty

Setup. Given an \(\varepsilon\)-approximate condensation \((Y_B)_{B \in \mathcal{P}^+I}\) satisfying \(\chi_L(B) \leq H(X_B) + \varepsilon\) for all \(B\), fix a subset \(A \subseteq I\) and define:

\(\Lambda = Y_{\supseteq A} = (Y_B : B \supseteq A)\), the collection of all latent variables whose index set contains \(A\).
\(\Lambda^{\prime} = (Y_B : B \cap A \neq \emptyset,\ |B| \geq 2,\ B \not\supseteq A)\), the collection of non-singleton latent variables that partially overlap with \(A\) but do not contain it.
\(\Lambda^* = (\Lambda, \Lambda^{\prime})\), the combined latent variable.

Assume additionally that \(H(\Lambda^{\prime} \mid \Lambda) \leq \varepsilon\), meaning there is approximately no lower-level structure among the variables in \(A\) beyond what is captured by \(\Lambda\).

The goal of this exercise is to show that \(\Lambda\) is an \((n+1)\varepsilon\)-approximate strong natural latent over \((X_i)_{i \in A}\).

Strong Redundancy

Part 3(a). Show that \(\Lambda\) satisfies strong redundancy:

\[ H(\Lambda \mid X_i) \leq \varepsilon, \quad \forall i \in A. \]

Hint for Part 3(a)

Recall Part 2(b). Since \(\Lambda = Y_{\supseteq A}\) and every \(Y_B\) in \(\Lambda\) has \(B \supseteq A \ni i\), each such \(Y_B\) contributes to \(X_i\). Apply the condensation bound as in Exercise 2 to conclude \(H(\Lambda \mid X_i) \leq \varepsilon\).

Further hint for Part 3(a)

\(\Lambda = Y_{\supseteq A}\) is a deterministic function of \(Y_{\ni i}\) for any \(i \in A\) (since \(B \supseteq A\) implies \(i \in B\)). So \(\Lambda\) is a deterministic function of \(Y_{\cap \\{i\\}}\), and by Part 2(a), \(H(Y_{\cap \\{i\\}} \mid X_i) \leq \varepsilon\), hence \(H(\Lambda \mid X_i) \leq \varepsilon\).

Mediation

Part 3(b). Show that for any random variable \(Z\),

\[ H(Z \mid \Lambda) \leq H(Z \mid \Lambda^*) + H(\Lambda^* \mid \Lambda). \tag{1} \]

Hint for Part 3(b)

Write \[H(Z, \Lambda) \leq H(Z, \Lambda, \Lambda^*),\] expand using the chain rule: \(H(Z, \Lambda, \Lambda^*) = H(\Lambda) + H(\Lambda^* \mid \Lambda) + H(Z \mid \Lambda, \Lambda^*)\). Subtracting \(H(\Lambda)\) from both sides gives \(H(Z \mid \Lambda) \leq H(\Lambda^* \mid \Lambda) + H(Z \mid \Lambda^*)\).

Part 3(c). Argue that

\[ H(X_A \mid \Lambda^*) \leq H(X_A \mid \Lambda). \tag{2} \]

Hint for Part 3(c)

Since \(\Lambda\) is a component of \(\Lambda^* = (\Lambda, \Lambda')\), conditioning on \(\Lambda^*\) gives at least as much information as conditioning on \(\Lambda\) alone. Therefore \(H(X_A \mid \Lambda^*) \leq H(X_A \mid \Lambda)\).

Further hint for Part 3(c)

This is the general fact that conditioning on more information can only reduce entropy: \(H(X_A \mid \Lambda, \Lambda') \leq H(X_A \mid \Lambda)\) because \(\Lambda^* = (\Lambda, \Lambda')\) contains \(\Lambda\).

Part 3(d). Using (1) and (2) together with the assumption \(H(\Lambda^{\prime} \mid \Lambda) \leq \varepsilon\) and the fact that \(\Lambda^*\) is an \(\varepsilon\)-approximate weak natural latent (from Exercise 2), show that \(\Lambda\) satisfies mediation:

\[ \mathrm{Med}(\Lambda) = \sum_{i \in A} H(X_i \mid \Lambda) - H(X_A \mid \Lambda) \leq (n+1)\varepsilon. \]

Hint for Part 3(d)

Apply (1) to each \(Z = X_i\): \(H(X_i \mid \Lambda) \leq H(X_i \mid \Lambda^*) + H(\Lambda^* \mid \Lambda)\). Note \(H(\Lambda^* \mid \Lambda) = H(\Lambda' \mid \Lambda) \leq \varepsilon\). Summing over \(i \in A\): \(\sum_i H(X_i \mid \Lambda) \leq \sum_i H(X_i \mid \Lambda^*) + n\varepsilon\). Since \(\Lambda^*\) is an \(\varepsilon\)-approximate weak natural latent, \(\sum_i H(X_i \mid \Lambda^*) - H(X_A \mid \Lambda^*) \leq \varepsilon\). Using (2) to replace \(-H(X_A \mid \Lambda^*)\) with \(-H(X_A \mid \Lambda)\) gives the bound \((n+1)\varepsilon\).

Exercise 4: Natural latents yield condensations

🔵🔵🔵🔵🔵 Importance
🔴🔴🔴⚫️⚫️ Difficulty

Setup. Given an \(\varepsilon\)-approximate natural latent \(\Lambda\) over \((X_i)_{i \in A}\), we construct a latent variable model \((Y_B)_{B \in \mathcal{P}^+A}\) by setting:

\(Y_A = \Lambda\)
\(Y_{\{i\}} = X_i\) for each \(i \in A\)
All other latent variables empty (constant)

The goal of this exercise is to show that this construction yields a \(2\varepsilon\)-approximate condensation over \((X_i)_{i \in A}\).

Part 4(a). Show that the conditioned score satisfies

\[ \chi(A) \leq H(X_A, \Lambda) + \varepsilon. \]

Hint for Part 4(a)

Use the mediation property of \(\Lambda\). Expand the conditioned score with the specific LVM structure: \(\chi(A) = H(\Lambda) + \sum_i H(X_i \mid \Lambda)\). By mediation, \(\sum_i H(X_i \mid \Lambda) \leq H(X_A \mid \Lambda) + \varepsilon\). So \(\chi(A) \leq H(\Lambda) + H(X_A \mid \Lambda) + \varepsilon = H(X_A, \Lambda) + \varepsilon\).

Part 4(b). Using Part 4(a), show that the construction is a \(2\varepsilon\)-approximate condensation, i.e.

\[ \chi(A) \leq H(X_A) + 2\varepsilon. \]

Hint for Part 4(b)

Use the strong redundancy property of \(\Lambda\). Since \(H(\Lambda \mid X_i) \leq \varepsilon\) for all \(i\), we have \(H(\Lambda \mid X_A) \leq \varepsilon\). Therefore \(H(X_A, \Lambda) = H(X_A) + H(\Lambda \mid X_A) \leq H(X_A) + \varepsilon\). Combining with Part 4(a): \(\chi(A) \leq H(X_A) + 2\varepsilon\).

Remark. Both the mediation and strong redundancy properties of \(\Lambda\) transfer to any subset \(B \subset A\). Therefore, the same argument applies with \(A\) replaced by \(B\) throughout, and we obtain the same \(2\varepsilon\) bound on the conditioned score.

Exercise 5: Hierarchical natural latent models (More difficult)

🔵🔵🔵⚫️⚫️ Importance
🔴🔴🔴🔴🔴 Difficulty

Setup. A hierarchical natural latent model formalises the idea that knowledge can be organised into a nested hierarchy of concepts: at the base level we have observables, at the next level we have natural latents that summarise shared information within small groups of observables, and at higher levels we have natural latents that summarise shared information across those groups, and so on. For example, individual images of cats and dogs are the observables; the features shared within each species (“cat-features”, “dog-features”) form first-order latents; and the features shared across all animals form a second-order latent.

Formally, we define a sequence of partitions \(Q^0, Q^1, \ldots, Q^n\) of \(I\), where:

\(Q^0 = \\{\\{i\\} : i \in I\\}\) is the finest partition (singletons).
\(Q^{n} = \\{I\\}\) is the coarsest partition (all of \(I\)).
\(Q^{k-1}\) is a refinement of \(Q^k\) for each \(k\): every block of \(Q^{k-1}\) is contained in a block of \(Q^k\).

We then define

\[ R = \{\Lambda^k_A : A \in Q^k,\; k \in 0 \ldots n\}, \]

where \(\Lambda^0_{\{i\}} = X_i\) are the observables, \(\Lambda^1_A\) is an \(\varepsilon\)-approximate natural latent over \((X_i)_{i \in A}\) for each \(A \in Q^1\), and \(\Lambda^k_A\) is an \(\varepsilon\)-approximate natural latent over \((\Lambda^{k-1}_B)_{B \subset A, B \in Q^{k-1}}\) for each \(A \in Q^k\), \(k \geq 1\).

For \(B \notin Q^k\) but \(B \in Q^r\) for some \(r > k\), we define \(\Lambda^k_B = (\Lambda^k_C : C \subset B, C \in Q^k)\).

We further require the conditional independence condition: for each \(k\) and each \(B \in Q^k\), \(\Lambda^{k-1}_B\) is conditionally independent of \(\Lambda^{k-1}_{\bar{B}}\) given \(\Lambda^k_B\) (where \(\bar{B}\) denotes the complement of \(B\) within the next block up).

To construct an LVM from \(R\), set \(Y_A = \Lambda^k_A\) for all \(\Lambda^k_A \in R\), and leave all other latent variables empty. The goal of this exercise is to show that this LVM is an approximate condensation.

Part 5(a). Prove the following lemma: for each level \(k\),

\[ \sum_{B \in Q^k} \sum_{C \subset B} H(\Lambda^{k-1}_C \mid \Lambda^k_B) \leq H(\Lambda^{k-1}_I \mid \Lambda^k_I) + K\varepsilon, \]

where \(K\) is a constant depending on the partition structure.

Hint for Part 5(a)

Use the conditional independence condition to show that \[H(\Lambda^{k-1}_B \mid \Lambda^k_B) = H(\Lambda^{k-1}_B \mid \Lambda^k_I).\] Use the mediation property of \(\Lambda^k_B\) over \((\Lambda^{k-1}_C)_{C \subset B}\) to bound \[\sum_{C \subset B} H(\Lambda^{k-1}_C \mid \Lambda^k_B);\] remember that \(Q^k\) forms a partition.

Remark. For any subset \(A \subseteq I\), an analogous version of the lemma holds with \(I\) replaced by \(\cap A\) (restricting all variables and partitions to those that contribute to \(A\)):

\[ \sum_{\substack{B \in Q^k \\\\ B \cap A \neq \emptyset}} \sum_{\substack{C \subset B \\\\ C \cap A \neq \emptyset}} H(\Lambda^{k-1}_C \mid \Lambda^k_B) \leq H(\Lambda^{k-1}_{\cap A} \mid \Lambda^k_{\cap A}) + K\varepsilon. \]

This holds because mediation transfers to subsets.

Part 5(b). Using Part 5(a) and its subset version, show that for any \(A \subseteq I\), the conditioned score satisfies

\[ \chi(A) \leq H(\Lambda^k_{\cap A}, \ k = 0 \ldots n) + K_2 \varepsilon. \]

Hint for Part 5(b)

Note that \(\Lambda^k_{\cap A}\) is an approximate deterministic function of \(\Lambda^{k-1}_{\cap A}\), which means \[H(Z|\Lambda^k_{\cap A}, \Lambda^{k-1}_{\cap A}) \approx H(Z|\Lambda^{k-1}_{\cap A}).\] Telescope the conditioned scores across levels using Part 5(a) and accumulate the \(\varepsilon\) errors.

Part 5(c). Using Part 5(b), show that

\[ \chi(A) \leq H(X_A) + K_3 \varepsilon, \]

concluding that the LVM constructed from \(R\) is a \(K_3\varepsilon\)-approximate condensation.

Hint for Part 5(c)

From Part 5(b), \(\chi(A) \leq H(\Lambda^k_{\cap A},\, k = 0 \ldots n) + K_2 \varepsilon\). Since \(\Lambda^0_{\cap A} = X_A\) and each higher-level \(\Lambda^k_{\cap A}\) is an approximate deterministic function of \(X_A\) (by strong redundancy, composed across levels), we have \(H(\Lambda^k_{\cap A},\, k = 0 \ldots n) \leq H(X_A) + K' \varepsilon\) for some constant \(K'\). Combining gives the result.