Mutual Information

Mutual information and its cousin, the Uncertainty coefficient (Theil’s U) are useful tools from Information Theory for discovering dependencies between variables that are not necessary described by a linear relationship.

Plenty of good material already exists on the subject: see Section 1.6 in “Pattern Recognition and Machine Learning” by Bishop, freely available as PDF online. What I hope to contribute here are some trivial examples, to help build intuition, based on categorical/multinomial variables.

We’ll use (and expand) the following running example: Consider a population of \(n\) persons labelled \(1, 2, \ldots, i, \ldots, n\) and let \(x\_i\) be the favourite dish of person \(i\) with the following distribution:

\(x\)	\(p(x)\)
pizza	0.3
barbecue	0.5
ramen	0.2

Information contained in an observation

Let \(h(x) = -\log_2 p(x)\) be the information measured in bits (if the natural logarithm \(\ln\) is used instead, the unit is called “nats”) contained in observing \(x\), where the probability of \(x\) is given by \(p(x)\). Bishop refers to this quantity as “the surprise” of observing \(x\) assuming it comes from the distribution described by \(p(x)\).

Now, in our example, observing a persons favorite dish is pizza carries \(h(x) = -\log_2 0.3 \approx 1.74\) bits of information, whereas observing a persons favorite dish is barbecue carries \(h(x) = -\log_2 0.5 = 1\) bit of information.

Entropy of a random variable

What we then call entropy: \(H[x]\), is the expected amount of information in the observation of a random variable:

\[H[x] = E[h(x)] = - \sum_x p(x) h(x) = -\sum_x p(x) \log_2 p(x)\] For our example:

\[H[x] = - (0.3 \log_2 0.3 + 0.5 \log_2 0.5 + 0.2 \log_2 0.2) \approx 1.49~\text{bits}\]

The information theory interpretation of this would be that \(H[x]\) is the (theoretical) minimum number of bits required per \(x\), to encode a string of successive, independent \(x\)’s.

Conditional Entropy

If two random variables, \(x\) and \(y\), are dependent (i.e. they are somehow related), then knowing something about \(y\) will provide us additional information about \(x\) (and vice versa), decreasing the information required (entropy) to describe \(x\). This can be expressed as conditional entropy:

\[ H[x|y] = - \sum_x \sum_y p(x, y) \log_2 p(x|y) = - \sum_x \sum_y p(x, y) \log_2 \frac{p(x, y)}{p(y)} \]

\(H[x|y]\) can then be interpreted as the expected number of additional bits required to encode \(x\) given that the value of \(y\) is already known. Let’s expand our example, and let \(y_i\) represent the nationality of person \(i\). Let the joint distribution \(p(x,y)\) be as follows:

	Italian	American	Japanese	marginal \(p(x)\)
pizza	0.09	0.16	0.05	0.3
barbecue	0.02	0.38	0.1	0.5
ramen	0.01	0.05	0.14	0.2
marginal \(p(y)\)	0.12	0.59	0.29	1.0

Note that \(x\) and \(y\) are not independent here - for example, if \(i\) is Italian, \(i\) is expected to have a higher preference for Pizza (stereotypical, I know), than the population as a whole.

Then for our example: \[ H[x|y] = - \begin{pmatrix} 0.09 \log_2 \frac{0.09}{0.12} &+ 0.16 \log_2 \frac{0.16}{0.59} &+ 0.05 \log_2 \frac{0.05}{0.29}\\ + 0.02 \log_2 \frac{0.02}{0.12} &+ 0.38 \log_2 \frac{0.38}{0.59} &+ 0.1 \log_2 \frac{0.1}{0.29}\\ + 0.01 \log_2 \frac{0.01}{0.12} &+ 0.05 \log_2 \frac{0.05}{0.59} &+ 0.14 \log_2 \frac{0.14}{0.29} \end{pmatrix} \approx 1.27~\text{bits} \]

Clearly \(H[x|y] < H[x]\) in our example, meaning that if we already know the value of \(y\), less information (fewer bits) is required to describe \(x\). If we would do the same exercise for \(H[y|x]\), we’d see that \(H[y|x] < H[y]\) – This is an effect of them not being independent. They carry some amount of mutual information.

Mutual information can be defined using KL-divergence as:

\[ I[x, y] = KL(p(x,y) || p(x)p(y)) \]

Note that if \(x\) and \(y\) were independent, then \(p(x,y) = p(x)p(y)\) with KL-divergence (and mutual information) being 0. An intuitive way to think about it is as a measure of dissimilarity between the distributions \(p(x,y)\) and \(p(x)p(y)\).

If you’re not familiar with KL-divergence, trust that we can manipulate the above equation into more familiar terms:

\[ I[x, y] = H[x] - H[x|y] = H[y] - H[y|x] \]

Thus, in our running example, \(I[x,y] \approx 0.22\) bits.

Normalized Mutual Information

If you want to use Mutual Information as an indicator of how strongly two variables correlate, it is useful to normalize into a range of \([0,1]\). The most intuitive way of doing this might be:

\[ U[X, Y] = \frac{2 I[x, y]}{H[x] + H[y]} \]

That is, as the mutual information of \(x\) and \(y\), divided by the mean entropy of \(x\) and \(y\). This is called symmetric uncertainty.

If you’re using mutual information to understand which exogenous variables \(x\) or \(w\) best predict endogenous variable \(y\), the entropy of the exogenous variables isn’t interesting (assuming we’re considering the exogenous variables known without uncertainty at prediction time, which is commonly the case). For this exercise, it is more useful to normalize Mutual Information as:

\[ U[Y|X] = \frac{I[x, y]}{H[y]} \]

This would be the uncertainty coefficient (Theil’s U).

Information contained in an observation

Entropy of a random variable

Conditional Entropy

Mutual Information

Normalized Mutual Information

See also