Mutual Information – Categorical Variable Example

Mutual information and its cousin, the Uncertainty coefficient (Theil’s U) are useful tools from Information Theory for discovering dependencies between variables that are not necessary described by a linear relationship.

I’m short on time, and plenty of good material already exists on the subject: see Section 1.6 in “Pattern Recognition and Machine Learning” by Bishop, freely available as PDF online. What I find missing are some trivial examples to help build intuition. That’s what I will contribute here, using categorical variables.

We’ll use (and expand) the following running example: Consider a population of n persons labelled 1, 2, \ldots, i, \ldots, n, and let x_i be the favourite dish of person i, with the following distribution:

xp(x)
pizza0.3
barbecue0.5
ramen0.2

Information contained in an observation

Let h(x) = -\log_2 p(x) be the information measured in bits (if the natural logarithm \ln is used instead, the unit is called “nats”) contained in observing x, where the probability of x is given by p(x). Bishop refers to this quantity as “the surprise” of observing x assuming it comes from the distribution described by p(x).

Now, in our example, observing a persons favorite dish is pizza carries h(x) = -\log_2 0.3 \approx 1.74 bits of information, whereas observing a persons favorite dish is barbecue carries h(x) = -\log_2 0.5 = 1 bit of information (less surprise).

Entropy of a random variable

What we then call entropy: H[x], is the expected amount of information in the observation of a random variable:

    \[H[x] = E[h(x)] = - \sum_x p(x) h(x) = -\sum_x p(x) \log_2 p(x)\]

For our example:

    \[H[x] = - (0.3 \log_2 0.3 + 0.5 \log_2 0.5 + 0.2 \log_2 0.2) \approx 1.49~\text{bits}\]

The information theory interpretation of this would be that H[x] is the (theoretical) minimum number of bits required per x, to encode a string of successive, independent x‘s.

Conditional entropy

If two random variables, x and y, are dependent (i.e. they are somehow related), then knowing something about y will provide us additional information about x (and vice versa), decreasing the information required (entropy) to describe x. This can be expressed as conditional entropy:

    \[H[x|y] = - \sum_x \sum_y p(x, y) \log_2 p(x|y) = - \sum_x \sum_y p(x, y) \log_2 \frac{p(x, y)}{p(y)}\]

H[x|y] can then be interpreted as the expected number of additional bits required to encode x given that the value of y is already known. Let’s expand our example, and let y_i represent the nationality of person i. Let the joint distribution p(x,y) be as follows:

ItalianAmericanJapanesemarginal p(x)
pizza0.090.160.050.3
bbq0.020.380.10.5
ramen0.010.050.140.2
marginal p(y)0.120.590.291.0

Note that x and y are not independent here – for example, if i is Italian, i is expected to have a higher preference for Pizza (stereotypical, I know), than the population as a whole.

Then for our example:

    \[H[x|y] = - \begin{pmatrix}0.09 \log_2 \frac{0.09}{0.12} &+ 0.16 \log_2 \frac{0.16}{0.59} &+ 0.05 \log_2 \frac{0.05}{0.29}\\+ 0.02 \log_2 \frac{0.02}{0.12} &+ 0.38 \log_2 \frac{0.38}{0.59} &+ 0.1 \log_2 \frac{0.1}{0.29}\\+ 0.01 \log_2 \frac{0.01}{0.12} &+ 0.05 \log_2 \frac{0.05}{0.59} &+ 0.14 \log_2 \frac{0.14}{0.29}\end{pmatrix} \approx 1.27~\text{bits} \]

Clearly H[x|y] < H[x] in our example, meaning that if we already know the value of y, less information (fewer bits) is required to describe x. If we would do the same exercise for H[y|x], we’d see that H[y|x] < H[y] — This is an effect of them not being independent. They carry some amount of mutual information.

Mutual Information

Mutual information can be defined by KL-divergence as:

    \[I[x, y] = KL(p(x,y) || p(x)p(y))\]


Note that if x and y were independent, then p(x,y) = p(x)p(y) with KL-divergence (and mutual information) being 0. An intuitive way to think about it is as a measure of dissimilarty between the distributions p(x,y) and p(x)p(y).

Manipulating this algebraically, we can however also write I[x, y] in more familiar terms:

    \[I[x, y] = H[x] - H[x|y] = H[y] - H[y|x] \]

Thus, in our running example, I[x,y] \approx 0.22 bits.

To be expanded a bit when I get more time…

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.