C.5 Abstractions: Natural Latents

Natural latents readings and exercises for the Abstractions session.

Natural Latents

Relevant readings: Abstractions as redundant information, The minimal latents approach to natural abstractions, Natural latents: the concepts, Some rules for an algebra of Bayes nets, Natural latents: introduction

Exercises: Natural latents

The following are exercises on natural latents. Each problem is broken into a sequence of lemmas leading to a main theorem. For each subquestion, try to prove the stated lemma before reading on. If you get stuck, you may treat the lemma as given and proceed to the next part.

Natural latent Setup

Let \(X_1, \ldots, X_n\) be observable random variables and \(\Lambda\) a latent variable. We write \(X = (X_1, \ldots, X_n)\). All entropy and mutual information quantities are understood with respect to the joint distribution \(P[X, \Lambda]\).

Key definitions.

Mediation. \(\Lambda\) is an exact mediator over \(X_1, \ldots, X_n\) if

\[ P[X_1, \ldots, X_n \mid \Lambda] = \prod_{i} P[X_i \mid \Lambda], \]

i.e. \(X_1, \ldots, X_n\) are conditionally independent given \(\Lambda\). The mediation error of a latent variable \(\Lambda\) is

\[ \varepsilon_{\mathrm{med}}(\Lambda) = D_{\mathrm{KL}}\!\bigl(P[X_1,\ldots,X_n,\Lambda] \;\|\; P[\Lambda]\textstyle\prod_i P[X_i\mid\Lambda]\bigr). \]

Redundancy. \(\Lambda\) is an exact redund over \(X_1, \ldots, X_n\) if \(\Lambda\) is fully determined by each \(X_i\) individually, i.e. \(H(\Lambda \mid X_i) = 0\) for all \(i\). The redundancy error of a latent variable \(\Lambda\) is

\[ \varepsilon_{\mathrm{red}}(\Lambda) = \sum_i H(\Lambda \mid X_i). \]

Naturality. \(\Lambda\) is a natural latent over \(X_1, \ldots, X_n\) if it satisfies both mediation and redundancy (exactly, or approximately up to some \(\varepsilon > 0\)).

Notation. We write \(TC(X \mid \Lambda)\) for the conditional total correlation:

\[ TC(X \mid \Lambda) = \sum_i H(X_i \mid \Lambda) - H(X \mid \Lambda). \]

Exercise 1: Mediation error as conditional total correlation

🔵🔵🔵🔵🔵 Importance
🔴🔴🔴⚫️⚫️ Difficulty

Throughout this exercise, assume \(\Lambda\) is a deterministic function of \(X\), i.e. \(H(\Lambda \mid X) = 0\).

Part 1(a). Show that the mediation error equals the conditional total correlation:

\[ \varepsilon_{\mathrm{med}}(\Lambda) = TC(X \mid \Lambda) = \sum_i H(X_i \mid \Lambda) - H(X \mid \Lambda). \]

You may find the following identity useful:

Chain rule for KL divergence. For random variables \(A, B\):

\[ D_{\mathrm{KL}}(P[A,B] \,\|\, Q[A,B]) = D_{\mathrm{KL}}(P[A] \,\|\, Q[A]) + \mathbb{E}_A\bigl[D_{\mathrm{KL}}(P[B\mid A] \,\|\, Q[B\mid A])\bigr]. \]

Hint for Part 1(a)

Apply the chain rule for KL divergence with \(A = \Lambda\) and \(B = X\). Note that both the true joint and the product distribution share the same marginal \(P[\Lambda]\), so the first KL term vanishes. The remaining expectation term is exactly the conditional total correlation \(TC(X \mid \Lambda)\), which expands to \(\sum_i H(X_i \mid \Lambda) - H(X \mid \Lambda)\).

Part 1(b). Show that if \(\Lambda\) is a deterministic function of \(X\), then

\[ H(X \mid \Lambda) = H(X) - H(\Lambda). \]

You may find the following identity useful:

Mutual information. \(I(X; Y) = H(X) - H(X \mid Y)\)

Hint for Part 1(b)

What is the relationship between \(H(\Lambda)\) and \(I(\Lambda; X)\) if \(\Lambda\) is a deterministic function of \(X\)?

Further hint for Part 1(b)

Since \(\Lambda\) is a deterministic function of \(X\), we have \(H(\Lambda \mid X) = 0\), i.e. knowing \(X\) determines \(\Lambda\) exactly. Therefore \(I(X; \Lambda) = H(\Lambda) - H(\Lambda \mid X) = H(\Lambda)\). But also \(I(X; \Lambda) = H(X) - H(X \mid \Lambda)\). Equating gives \(H(X \mid \Lambda) = H(X) - H(\Lambda)\).

Part 1(c).

Let \(\mathcal{A}_k = \\{\Lambda' : \exists f \text{ s.t. } \Lambda=f(X),\ H(\Lambda') = k\\}\) be the set of latent variables that are deterministic functions of \(X\) with entropy exactly \(k\).

Show that \(\Lambda \in \mathcal{A}_k\) minimizes the mediation error \(\varepsilon_{\mathrm{med}}(\Lambda)\) (among latent variables in \(\mathcal{A}_k\)) if and only if it maximizes \(\sum_i I(X_i; \Lambda)\).

Hint for Part 1(c)

From Parts 1(a) and 1(b), the mediation error is \(\varepsilon_{\mathrm{med}}(\Lambda) = \sum_i H(X_i \mid \Lambda) - H(X) + H(\Lambda)\). Rewrite \(H(X_i \mid \Lambda) = H(X_i) - I(X_i; \Lambda)\). For fixed \(k\), the terms \(\sum_i H(X_i) - H(X) + k\) are constant, so minimizing \(\varepsilon_{\mathrm{med}}\) is equivalent to maximizing \(\sum_i I(X_i; \Lambda)\).

Stepping-stone hint for Part 1(c)

Three steps:

Apply Part 1(b) to rewrite \(H(X \mid \Lambda) = H(X) - H(\Lambda)\) in the expression from Part 1(a).
Rewrite each \(H(X_i \mid \Lambda)\) as \(H(X_i) - I(X_i; \Lambda)\).
Note that \(\sum_i H(X_i)\), \(H(X)\), and \(H(\Lambda) = k\) are all constant when optimizing over \(\mathcal{A}_k\).

Exercise 2: Redundancy and the Pareto frontier of naturality

🔵🔵🔵🔵🔵 Importance
🔴🔴🔴⚫️⚫️ Difficulty

Part 2(a). Let \(\mathcal{A}_k = \\{\Lambda' : H(\Lambda' \mid X) = 0,\ H(\Lambda') = k\\}\) as before.

Show that \(\Lambda \in \mathcal{A}_k\) minimizes the redundancy error

\[ \varepsilon_{\mathrm{red}}(\Lambda) = \sum_i H(\Lambda \mid X_i) \]

if and only if it maximizes \(\sum_i I(X_i; \Lambda)\).

Hint for Part 2(a)

Write \(H(\Lambda \mid X_i) = H(\Lambda) - I(\Lambda; X_i)\). Then \(\varepsilon_{\mathrm{red}}(\Lambda) = nH(\Lambda) - \sum_i I(X_i; \Lambda) = nk - \sum_i I(X_i; \Lambda)\), which is constant in \(k\) minus \(\sum_i I(X_i; \Lambda)\). So minimizing \(\varepsilon_{\mathrm{red}}\) is equivalent to maximizing \(\sum_i I(X_i; \Lambda)\).

Part 2(b). Define

\[ \varepsilon_{\mathrm{med}}^*(k) = \min_{\Lambda' \in \mathcal{A}_k} \varepsilon_{\mathrm{med}}(\Lambda'), \qquad \varepsilon_{\mathrm{red}}^*(k) = \min_{\Lambda' \in \mathcal{A}_k} \sum_i \varepsilon_{\mathrm{red}_i}(\Lambda'). \]

Assume that \(\varepsilon_{\mathrm{red}}^*(k)\) weakly increases with \(k\) and \(\varepsilon_{\mathrm{med}}^*(k)\) weakly decreases with \(k\).

Define the (weak) Pareto frontier of naturality of \(X\) as the set of latent variables that are (weakly) Pareto-optimal with respect to the mediation and redundancy errors, i.e. no other latent variable achieves a strictly lower error on both errors.

Characterize the Pareto frontier of naturality. What are the parameters that matter?

Hint for Part 2(b)

Make use of Parts 1(c) and 2(a), and the provided assumption. Since both objectives (minimizing mediation error and minimizing redundancy error) reduce to maximizing \(\sum_i I(X_i; \Lambda)\) at a fixed entropy level \(k\), the Pareto frontier is parameterized by \(k\). Each point on the frontier corresponds to the optimizer at a given \(k\).

Further hint for Part 2(b)

Two distinct questions: (1) How do you get onto the frontier at entropy level \(k\)? Maximize \(\sum_i I(X_i; \Lambda)\) — this simultaneously minimizes both errors by Parts 1(c) and 2(a). (2) Where along the frontier do you sit? This is determined by \(k\): increasing \(k\) decreases mediation error but increases redundancy error (by the monotonicity assumptions).

Exercise 3: Minimal mediators, maximal redunds, and uniqueness

🔵🔵🔵🔵⚫️ Importance
🔴🔴🔴🔴⚫️ Difficulty

Definitions. We distinguish two related but distinct notions for mediators and redunds:

An minimal mediator over \(X_1, \ldots, X_n\) is a mediator \(\Lambda\) such that for any other mediator \(\Lambda'\), we have \(H(\Lambda \mid \Lambda') < \varepsilon\), i.e. \(\Lambda\) is an approximate deterministic function of every other mediator.
A minimum mediator over \(X_1, \ldots, X_n\) is an exact mediator \(\Lambda^*\) with the smallest entropy \(H(\Lambda^*)\) among all mediators.
An maximal redund over \(X_1, \ldots, X_n\) is a redund \(\Lambda\) such that for any other redund \(\Lambda'\), we have \(H(\Lambda' \mid \Lambda) < \varepsilon\), i.e. every other redund is an approximate deterministic function of \(\Lambda\).
A maximum redund over \(X_1, \ldots, X_n\) is an exact redund \(\Lambda_*\) with the largest entropy \(H(\Lambda_*)\) among all redunds.

Conceptually, a minimal mediator need not be a minimum mediator, and a maximal redund need not be a maximum redund. The exercises below establish connection between the two.

Part 3(a). Let \(\Lambda\) and \(\Lambda'\) be two minimal mediators over \(X_1, \ldots, X_n\). Show that they are approximately isomorphic, i.e.

\[ H(\Lambda \mid \Lambda') < \varepsilon \quad \text{and} \quad H(\Lambda' \mid \Lambda) < \varepsilon. \]

Conclude that any two maximal redunds are likewise approximately isomorphic.

Hint for Part 3(a)

Since \(\Lambda\) is a minimal mediator, it is an approximate deterministic function of every other mediator – in particular of \(\Lambda'\), so \(H(\Lambda \mid \Lambda') < \varepsilon\). Since \(\Lambda'\) is also a minimal mediator, by the same argument with roles swapped, \(H(\Lambda' \mid \Lambda) < \varepsilon\). The same reasoning applies to maximal redunds by the symmetric definition.

Part 3(b). Let \(\Lambda\) be a minimal mediator and \(\Lambda^*\) a minimum mediator over \(X_1, \ldots, X_n\). Show that they are approximately isomorphic up to a constant multiple of \(\varepsilon\), i.e. there exists a small constant \(c\) such that

\[ H(\Lambda \mid \Lambda^*) < c\varepsilon \quad \text{and} \quad H(\Lambda^* \mid \Lambda) < c\varepsilon. \]

Prove the analogous result for any maximal redund \(\Lambda\) and maximum redund \(\Lambda_*\): they are approximately isomorphic up to a constant multiple of \(\varepsilon\).

Hint for Part 3(b)

Since \(\Lambda\) is a minimal mediator and \(\Lambda^*\) is a mediator, \(H(\Lambda \mid \Lambda^*) < \varepsilon\). For the other direction, use the minimality of \(H(\Lambda^*)\) together with the fact that \(H(\Lambda \mid \Lambda^*) < \varepsilon\) implies \(H(\Lambda)\) and \(H(\Lambda^*)\) are close, and argue that \(\Lambda^*\) must also be approximately recoverable from \(\Lambda\).

Part 3(c). Recall from Exercise 2 the assumption that \(\varepsilon_{\mathrm{red}}^*(k)\) weakly increases with \(k\) and \(\varepsilon_{\mathrm{med}}^*(k)\) weakly decreases with \(k\), where

We say that moving from entropy level \(k_1\) to \(k_2\) is a Pareto-improvement on naturality if

\[ \varepsilon_{\mathrm{med}}^*(k_2) \leq \varepsilon_{\mathrm{med}}^*(k_1) \quad \text{and} \quad \varepsilon_{\mathrm{red}}^*(k_2) \leq \varepsilon_{\mathrm{red}}^*(k_1), \]

with at least one inequality strict. In other words, moving from \(k_1\) to \(k_2\) does not worsen either error and strictly improves at least one.

Using the Pareto frontier characterised in Exercise 2(b), answer the following:

For which values of \(k\) is a Pareto-improvement on naturality possible by increasing \(k\)? By decreasing \(k\)?
For which values of \(k\) is there a real tradeoff between the mediation and redundancy errors, i.e. any change in \(k\) strictly worsens at least one error?

Hint for Part 3(c)

Consider the definitions of the maximum redund \(\Lambda_*\) and minimum mediator \(\Lambda^*\), and what they imply about \(\varepsilon_{\mathrm{red}}^*(k)\) and \(\varepsilon_{\mathrm{med}}^*(k)\) at \(k = H(\Lambda_*)\) and \(k = H(\Lambda^*)\) respectively. Below \(H(\Lambda_*)\), redundancy error is zero so increasing \(k\) is a Pareto-improvement (it can only help mediation). Above \(H(\Lambda^*)\), mediation error is zero so decreasing \(k\) is a Pareto-improvement (it can only help redundancy). In between, there is a genuine tradeoff.

Part 3(d). A natural latent over \(X_1, \ldots, X_n\) is a latent variable \(\Lambda\) that is simultaneously a maximal redund and a minimal mediator.

Using the approximate isomorphism between maximal and maximum redunds (and between minimal and minimum mediators) established in Parts 3(a) and 3(b), show that if a natural latent exists for \(X\), then \(H(\Lambda_*) \approx H(\Lambda^*)\).

What does the existence of natural latent for \(X\) say about the Pareto frontier of naturality of \(X\)?

Hint for Part 3(d)

If a natural latent \(\Lambda\) exists, it is both a maximal redund (approximately isomorphic to \(\Lambda_*\), so \(H(\Lambda) \approx H(\Lambda_*)\)) and a minimal mediator (approximately isomorphic to \(\Lambda^*\), so \(H(\Lambda) \approx H(\Lambda^*)\)). Therefore \(H(\Lambda_*) \approx H(\Lambda^*)\), meaning the genuine tradeoff region from Part 3(c) collapses to a single point. The Pareto frontier degenerates: there is essentially one optimal entropy level.