Mutual information and its cousin, the Uncertainty coefficient (Theil’s U) are useful tools from Information Theory for discovering dependencies between variables that are not necessary described by a linear relationship.
I’m short on time, and plenty of good material already exists on the subject: see Section 1.6 in “Pattern Recognition and Machine Learning” by Bishop, freely available as PDF online. What I find missing are some trivial examples to help build intuition. That’s what I will contribute here, using categorical variables.
We’ll use (and expand) the following running example: Consider a population of persons labelled
, and let
be the favourite dish of person
, with the following distribution:
![]() | ![]() |
pizza | 0.3 |
barbecue | 0.5 |
ramen | 0.2 |
Information contained in an observation
Let be the information measured in bits (if the natural logarithm
is used instead, the unit is called “nats”) contained in observing
, where the probability of
is given by
. Bishop refers to this quantity as “the surprise” of observing
assuming it comes from the distribution described by
.
Now, in our example, observing a persons favorite dish is pizza carries bits of information, whereas observing a persons favorite dish is barbecue carries
bit of information (less surprise).
Entropy of a random variable
What we then call entropy: , is the expected amount of information in the observation of a random variable:
For our example:
The information theory interpretation of this would be that is the (theoretical) minimum number of bits required per
, to encode a string of successive, independent
‘s.
Conditional entropy
If two random variables, and
, are dependent (i.e. they are somehow related), then knowing something about
will provide us additional information about
(and vice versa), decreasing the information required (entropy) to describe
. This can be expressed as conditional entropy:
can then be interpreted as the expected number of additional bits required to encode
given that the value of
is already known. Let’s expand our example, and let
represent the nationality of person
. Let the joint distribution
be as follows:
Italian | American | Japanese | marginal ![]() | |
pizza | 0.09 | 0.16 | 0.05 | 0.3 |
bbq | 0.02 | 0.38 | 0.1 | 0.5 |
ramen | 0.01 | 0.05 | 0.14 | 0.2 |
marginal ![]() | 0.12 | 0.59 | 0.29 | 1.0 |
Note that and
are not independent here – for example, if
is Italian,
is expected to have a higher preference for Pizza (stereotypical, I know), than the population as a whole.
Then for our example:
Clearly in our example, meaning that if we already know the value of
, less information (fewer bits) is required to describe
. If we would do the same exercise for
, we’d see that
— This is an effect of them not being independent. They carry some amount of mutual information.
Mutual Information
Mutual information can be defined by KL-divergence as:
Note that if





Manipulating this algebraically, we can however also write in more familiar terms:
Thus, in our running example, bits.
To be expanded a bit when I get more time…