Mutual Information – Categorical Variable Example

Mutual information and its cousin, the Uncertainty coefficient (Theil’s U) are useful tools from Information Theory for discovering dependencies between variables that are not necessary described by a linear relationship.

I’m short on time, and plenty of good material already exists on the subject: see Section 1.6 in “Pattern Recognition and Machine Learning” by Bishop, freely available as PDF online. What I find missing are some trivial examples to help build intuition. That’s what I will contribute here, using categorical variables.

We’ll use (and expand) the following running example: Consider a population of n persons labelled 1, 2, \ldots, i, \ldots, n, and let x_i be the favourite dish of person i, with the following distribution:


Information contained in an observation

Let h(x) = -\log_2 p(x) be the information measured in bits (if the natural logarithm \ln is used instead, the unit is called “nats”) contained in observing x, where the probability of x is given by p(x). Bishop refers to this quantity as “the surprise” of observing x assuming it comes from the distribution described by p(x).

Now, in our example, observing a persons favorite dish is pizza carries h(x) = -\log_2 0.3 \approx 1.74 bits of information, whereas observing a persons favorite dish is barbecue carries h(x) = -\log_2 0.5 = 1 bit of information (less surprise).

Entropy of a random variable

What we then call entropy: H[x], is the expected amount of information in the observation of a random variable:

    \[H[x] = E[h(x)] = - \sum_x p(x) h(x) = -\sum_x p(x) \log_2 p(x)\]

For our example:

    \[H[x] = - (0.3 \log_2 0.3 + 0.5 \log_2 0.5 + 0.2 \log_2 0.2) \approx 1.49~\text{bits}\]

The information theory interpretation of this would be that H[x] is the (theoretical) minimum number of bits required per x, to encode a string of successive, independent x‘s.

Conditional entropy

If two random variables, x and y, are dependent (i.e. they are somehow related), then knowing something about y will provide us additional information about x (and vice versa), decreasing the information required (entropy) to describe x. This can be expressed as conditional entropy:

    \[H[x|y] = - \sum_x \sum_y p(x, y) \log_2 p(x|y) = - \sum_x \sum_y p(x, y) \log_2 \frac{p(x, y)}{p(y)}\]

H[x|y] can then be interpreted as the expected number of additional bits required to encode x given that the value of y is already known. Let’s expand our example, and let y_i represent the nationality of person i. Let the joint distribution p(x,y) be as follows:

ItalianAmericanJapanesemarginal p(x)
marginal p(y)0.120.590.291.0

Note that x and y are not independent here – for example, if i is Italian, i is expected to have a higher preference for Pizza (stereotypical, I know), than the population as a whole.

Then for our example:

    \[H[x|y] = - \begin{pmatrix}0.09 \log_2 \frac{0.09}{0.12} &+ 0.16 \log_2 \frac{0.16}{0.59} &+ 0.05 \log_2 \frac{0.05}{0.29}\\+ 0.02 \log_2 \frac{0.02}{0.12} &+ 0.38 \log_2 \frac{0.38}{0.59} &+ 0.1 \log_2 \frac{0.1}{0.29}\\+ 0.01 \log_2 \frac{0.01}{0.12} &+ 0.05 \log_2 \frac{0.05}{0.59} &+ 0.14 \log_2 \frac{0.14}{0.29}\end{pmatrix} \approx 1.27~\text{bits} \]

Clearly H[x|y] < H[x] in our example, meaning that if we already know the value of y, less information (fewer bits) is required to describe x. If we would do the same exercise for H[y|x], we’d see that H[y|x] < H[y] — This is an effect of them not being independent. They carry some amount of mutual information.

Mutual Information

Mutual information can be defined by KL-divergence as:

    \[I[x, y] = KL(p(x,y) || p(x)p(y))\]

Note that if x and y were independent, then p(x,y) = p(x)p(y) with KL-divergence (and mutual information) being 0. An intuitive way to think about it is as a measure of dissimilarty between the distributions p(x,y) and p(x)p(y).

Manipulating this algebraically, we can however also write I[x, y] in more familiar terms:

    \[I[x, y] = H[x] - H[x|y] = H[y] - H[y|x] \]

Thus, in our running example, I[x,y] \approx 0.22 bits.

To be expanded a bit when I get more time…

KL Divergence Online Demo

The intersection of two current events have born fruit:

To try out Shiny, I created an interactive visualization for Kullback-Leibler divergence (or KL Divergence). Right now, it only supports two univariate Gaussians, which should be sufficient to build some intuition.

If you like it, let me know! If it turns out to be popular, I might add more features, or create similar visualizations for other concepts!

What is KL Divergence? What am I seeing?

Consider an unknown probability distribution p(x), which we’re trying to approximate with probability distribution q(x), then

    \[\text{KL}(p||q) = - \int p(x) \ln \frac{q(x)}{p(x)} dx\]

can informally be interpreted as the amount of information being lost by approximating p with q. As you might imagine, this has several applications in Machine Learning. A recurring pattern is to fit parameters to a model by minimizing an approximation of \text{KL}(p||q) (ie, making q “as similar” to p as possible). This blog post elaborates in a fun and informative way. If you have never heard about KL divergence before, Bishop provides a more formal (but still easy to understand) introduction in Section 1.6 of PRML.

Suggested exercises with the interactive plot

Using the visualization tool, find out (or verify) the answer to the following questions:

  • Is \text{KL}(p||q) = \text{KL}(q||p)? Always? Never?
  • When is \text{KL}(p||q) = 0?
  • Let r(x) = \mathcal{N}(0, 1) and s(x) = \mathcal{N}(0, 2). Which is larger: \text{KL}(r||s) or \text{KL}(s||r)? Why?
  • Is \text{KL}(p||q) ever negative? When, or why not?

Proving the set-covering problem is NP-complete (using reduction from the vertex-cover problem)

In this post, we will prove that the decision version of the set-covering problem is NP-complete, using a reduction from the vertex covering problem (which is NP-complete). This is Exercise 35.3-2 in CLRS3[1]. We will follow the template given in an earlier post.

Problem statement

We will use the definition for a set cover given on page 1118 in CLRS3[1], also given on Wikipedia. Since we’re concerned with the decision version, we seek to answer whether a set cover \mathcal{C} exists for X (the set containing the elements to be covered), containing a maximum of k sets

Showing the problem is in NP

Let \mathcal{C} = \{S_1, S_2, \ldots, S_n\} be the certificate. We assume that each S \in \mathcal{C} uniquely covers at least one x \in X, so:

(1)   \begin{align*}   |\mathcal{C}| &\leq |X|\\   \forall S \in \mathcal{C} : |S| &\leq |X| \end{align*}

This gives us that we can verify that n \leq k and that \mathcal{C} covers X in \mathcal{O}(|X|^2), which proves the problem is in NP.

Showing the problem is NP-hard

We will show that the decision version of the set-covering problem is NP-hard by showing:

(2)   \begin{equation*}   \textsc{Vertex-Cover} \leq_\text{P} \textsc{Set-Cover} \end{equation*}


The vertex cover of an undirected graph G = (V, E) is a subset V' \subseteq V, such that each e \in E is adjacent to at least one v' \in V [1]. Its decision version is to determine whether a vertex cover of at most size k exists for G.

We can reduce any instance of decision-vertex-cover for G = (V, E) to an instance of decision-set-cover by letting X = E and letting the family of sets \mathcal{F} = {S_v | v \in V\ \land \text{deg}(v) > 0}, such that S_v contains all edges in E adjacent to v. We can perform this reduction in \mathcal{O}(|E| + |V|), which is in polynomial time.

Example of vertex cover (shaded nodes are in V’)

The example image above shows a vertex cover of size. Making the reduction for this example gives:

(3)   \begin{align*}\mathcal{F} = \{S_{v_1} &= \{e_1, e_3\}, \\S_{v_3} &= \{e_2, e_5\},\\S_{v_4} &= \{e_3, e_6\},\\S_{v_6} &= \{e_4, e_7, e_8\}\}\end{align*}

We will now prove for any vertex cover V' for G and a set cover \mathcal{C} for sets X and \mathcal{F} constructed in accordance with the reduction above:

(4)   \begin{equation*}   \exists V'\subseteq V : |V'| \leq k \iff   \exists \mathcal{C} \subseteq \mathcal{F} : |\mathcal{C}| \leq k \end{equation*}

Since X = E, and we know V' is a vertex cover for G = (V, E), we can find \mathcal{C} for (X, \mathcal{F}) by selecting from \mathcal{F} all sets corresponding to v' \in V'.

Similarily, if \mathcal{C} exists, by the way we defined the reduction, we know it covers X = E. This means We can find V' by taking the vertices corresponding to S_v \in \mathcal{C}.

Due to the 1:1 correspondence of vertices in V' and sets in \mathcal{C}: |V'| = |\mathcal{C}|~\blacksquare

  1. [1]
    T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms. MIT Press, 2009.

Proving set-partition problem is NP complete (using reduction from subset sum)

In this post, we will prove that the set-partition problem is NP-complete using a reduction from the subset sum problem (which is NP-complete[1]). This is Exercise 34.5-5 in CLRS3[1]. We will follow the template given in an earlier post.

Problem statement

Given a set S of numbers, determine whether S can be partitioned into A and \overline{A} = S - A, such that:

(1)   \begin{equation*}\sum_{x \in A} x = \sum_{x \in \overline{A}} x\end{equation*}

Showing the problem is in NP

Let (A, \overline{A}) be the certificate. We can verify the certificate in linear time by summing the elements in each set, and comparing it

Showing the problem is NP-hard

We will show that the set-partition problem is NP-hard by showing:

(2)   \begin{equation*}\textsc{Subset-Sum} \leq_\text{P} \textsc{Set-Partition}\end{equation*}

Continue reading “Proving set-partition problem is NP complete (using reduction from subset sum)”

Proving 0-1 integer programming is NP-complete (using reduction from 3-CNF-SAT)

In this post, we will prove that 0-1 integer programming is NP-complete using a reduction from 3-CNF-SAT (which is NP-complete[1]). We will follow the template given in an earlier post.

Problem statement

The decision problem for 0-1 integer programming is formulated as follows[2]:

Given an integer m \times n matrix A and an integer m-vector b, determine whether there exists an integer n-vector x with elements in \{0, 1\}, such that:

(1)   \begin{equation*}Ax \le b\end{equation*}

Continue reading “Proving 0-1 integer programming is NP-complete (using reduction from 3-CNF-SAT)”

How to show a problem is NP-complete

This is the first post in a series of posts where I will attempt to give visual, easy to understand, proofs* of NP-completeness for a selection of decision problems. In this post, I will give a “template” which can be used (and will be used for the proofs I post).

Continue reading “How to show a problem is NP-complete”