I have been using EM for years without knowing some details. Thanks to Chris Bishop for the general description of EM in Section 9.4 of Pattern Recognition and Machine Learning. Here follows my learning note.

EM is designed to maximize p(**X**|**θ**) in case that optimization of p(**X**|**θ**) is difficult, but that optimization of the complete-data likelihood function p(**X**,**Z**|**θ**) is significantly easier.When we consider p(**X**,**Z**|**θ**) with the context of p(**X**|**θ**), we have to consider the distribution of **Z**, denoted by q(**Z**). Given q(**Z**), we can decompose p(**X**|**θ**) into two terms:

where

Note that both terms are functionals of q(Z) and are larger than or equal to zero. When q is identical to the true posterior distribution P(Z|X,**θ**), KL(q||p) is minimized to be zero and L(q,**θ**) is maximized to be ln p(X|**θ**).The EM algorithm iteratively executes an E-step and an M-step. In the E-step, it estimates P(Z|X,**θ**^{old}) and takes it as q(Z). This minimizes KL(q||p) to be zero.

In the M-step, it fixes q(Z) and maximizes L(q,**θ**) with respect to **θ**. This maximization does not only increases L(q,**θ**), but also increases KL(q||p), unless a local optimum is reached. This is because the true posterior p changes with** θ**, and any change increases KL(q||p) from 0 to a positive number. (We had maximized KL(q||p) to 0 in the E-step.) As both terms increases, ln P(X|**θ**) increases. This is why EM converges monotonically.

This iterative process is similar with a “growing worm”. (Describe the metaphor with more details later.)

It is notable that

Because the second term to the right of the equation is independent of **θ**, the M-step maximizes only the first term, which is usually denoted by Q(**θ**,**θ**^{old}) in the literature.

The figure to the left illustrates the EM algorithm in the model parameter space.

When the model is a exponential distribution or a mixture of exponential distribution, L(q, θ) is convex (as shown by the blue curve). Because in the E-step, K(q||p) is minimized to be zero, so the curve of L(q, θ) touches the curve of ln p(X|θ) at θ^{old}. By maximizing L(q, θ) with respect to θ in the M-step, we find θ^{new}, which corresponds to a larger ln p(X|θ) value. As explained previously, θ^{new} increases not only L(q, θ) but also KL(q||p)=ln p(X|θ) – L(q, θ). (Note in the figure that KL(q||p) at θ^{new} is larger than zero.)

Variational and Stochastic EM

Recall that we need to minimize KL(q||p) in the E-step by estimating p(Z|X,θ^{old}). For some models, this is intractable and we have to use some approximated methods. If we use variational methods, then EM becomes Variational EM. If we use stochastic methods, EM becomes Stochastic EM.

Generalized EM

In the M-step, we need to maximize L(q, θ). If L(q, θ) has multiple optima, or the maximization is intractable, we can only find an approximate solution. In such cases, EM becomes Generalized EM.