Mutual information and its cousin, the Uncertainty coefficient (Theil’s U) are useful tools from Information Theory for discovering dependencies between variables that are not necessary described by a linear relationship.

I’m short on time, and plenty of good material already exists on the subject: see Section 1.6 in “Pattern Recognition and Machine Learning” by Bishop, freely available as PDF online. What I find missing are some trivial examples to help build intuition. That’s what I will contribute here, using categorical variables.

We’ll use (and expand) the following running example: Consider a population of persons labelled , and let be the favourite dish of person , with the following distribution:

pizza | 0.3 |

barbecue | 0.5 |

ramen | 0.2 |

### Information contained in an observation

Let be the information measured in bits (if the natural logarithm is used instead, the unit is called “nats”) contained in observing , where the probability of is given by . Bishop refers to this quantity as “the surprise” of observing assuming it comes from the distribution described by .

Now, in our example, observing a persons favorite dish is pizza carries bits of information, whereas observing a persons favorite dish is barbecue carries bit of information (less surprise).

### Entropy of a random variable

What we then call *entropy*: , is the expected amount of information in the observation of a random variable:

For our example:

The information theory interpretation of this would be that is the (theoretical) minimum number of bits required per , to encode a string of successive, independent ‘s.

### Conditional entropy

If two random variables, and , are dependent (i.e. they are somehow related), then knowing something about will provide us additional information about (and vice versa), decreasing the information required (entropy) to describe . This can be expressed as *conditional entropy*:

can then be interpreted as the expected number of *additional* bits required to encode given that the value of is already known. Let’s expand our example, and let represent the nationality of person . Let the joint distribution be as follows:

Italian | American | Japanese | marginal | |

pizza | 0.09 | 0.16 | 0.05 | 0.3 |

bbq | 0.02 | 0.38 | 0.1 | 0.5 |

ramen | 0.01 | 0.05 | 0.14 | 0.2 |

marginal | 0.12 | 0.59 | 0.29 | 1.0 |

Note that and are *not* independent here – for example, if is Italian, is expected to have a higher preference for Pizza (stereotypical, I know), than the population as a whole.

Then for our example:

Clearly in our example, meaning that if we already know the value of , less information (fewer bits) is required to describe . If we would do the same exercise for , we’d see that — This is an effect of them not being independent. They carry some amount of *mutual *information.

### Mutual Information

Mutual information can be defined by KL-divergence as:

Note that if and were independent, then with KL-divergence (and mutual information) being 0. An intuitive way to think about it is as a measure of dissimilarty between the distributions and .

Manipulating this algebraically, we can however also write in more familiar terms:

Thus, in our running example, bits.

To be expanded a bit when I get more time…