Mutual information and its cousin, the Uncertainty coefficient (Theil’s U) are useful tools from Information Theory for discovering dependencies between variables that are not necessary described by a linear relationship.
I’m short on time, and plenty of good material already exists on the subject: see Section 1.6 in “Pattern Recognition and Machine Learning” by Bishop, freely available as PDF online. What I find missing are some trivial examples to help build intuition. That’s what I will contribute here, using categorical variables.
We’ll use (and expand) the following running example: Consider a population of persons labelled , and let be the favourite dish of person , with the following distribution:
Information contained in an observation
Let be the information measured in bits (if the natural logarithm is used instead, the unit is called “nats”) contained in observing , where the probability of is given by . Bishop refers to this quantity as “the surprise” of observing assuming it comes from the distribution described by .
Now, in our example, observing a persons favorite dish is pizza carries bits of information, whereas observing a persons favorite dish is barbecue carries bit of information (less surprise).
Entropy of a random variable
What we then call entropy: , is the expected amount of information in the observation of a random variable:
For our example:
The information theory interpretation of this would be that is the (theoretical) minimum number of bits required per , to encode a string of successive, independent ‘s.
If two random variables, and , are dependent (i.e. they are somehow related), then knowing something about will provide us additional information about (and vice versa), decreasing the information required (entropy) to describe . This can be expressed as conditional entropy:
can then be interpreted as the expected number of additional bits required to encode given that the value of is already known. Let’s expand our example, and let represent the nationality of person . Let the joint distribution be as follows:
Note that and are not independent here – for example, if is Italian, is expected to have a higher preference for Pizza (stereotypical, I know), than the population as a whole.
Then for our example:
Clearly in our example, meaning that if we already know the value of , less information (fewer bits) is required to describe . If we would do the same exercise for , we’d see that — This is an effect of them not being independent. They carry some amount of mutual information.
Mutual information can be defined by KL-divergence as:
Note that if and were independent, then with KL-divergence (and mutual information) being 0. An intuitive way to think about it is as a measure of dissimilarty between the distributions and .
Manipulating this algebraically, we can however also write in more familiar terms:
Thus, in our running example, bits.
To be expanded a bit when I get more time…