Finite models Infinite models


Latent Dirichlet allocation > Hierarchical Dirichlet process  mixtures
A A 
  
Start > Finite mixture model > Infinite mixture model  mixture
  
V V 
Hidden Markov model > Infinite hidden Markov model  temporal
Finite Mixture Model
This is the very basic modeling technique — represent a complex thing by a weighted sum of simple things — e.g., Fourier transformation represents a complex signal by a weighted sum of sine curves. A typical example of finite mixture model is Gaussian mixture, which is often learned using EM. Another example is multinomial mixture with Dirichlet prior, which can be learned using Gibbs sampling because Dirichlet and multinomial are conjugate.
Infinite Mixture Model
By modeling the component prior using Dirichlet process, finite mixture model is extended to have infinite mumber of mixture components. If the likelihood is conjugate with the Dirichlet process prior, the model can be conveniently learned using Gibbs sampling [Neal 1998, Markov Chain Sampling Methods for Dirichlet Process Mixture Models]. Another interesting model is [Rasmussen 2000, Infinite Gaussian Mixture Model].
Note that the Dirichlet process can be defined on any measure space, even if a space whose each sample point is a component (parameters). So the generative process becomes that a Dirichlet process G generates a component θ for each observed object x. Because the discreteandclustering property of Dirichlet process, components of multiple x’s are often the same, that is, these x’s are of the same component/cluster. So we can say that G (of infinite mixture model) encodes both components and their weights (of finite mixture model). We often use another Dirichlet process parameterized by H as the prior of G. Because G~DP(H) and θ~G are conjugate, G can be integrated out. The likelihood x~F(θ) can be of any form (is this true? need to read the infinite Gaussian mixture model paper.).
Latent Dirichlet Allocation
Mixture models represent the probability distribution of “a bag of” samples. However, in some cases, we need to model more than bags, where objects in each bag is represented by a mixture model. To capture the commonality between bags, we want mixture models of these bags share the same set of components, but each bag has its own set of weights. When the components are multinomial and weight sets of bags share a Dirichlet prior, this hierarchical mixture model is known as latent Dirichlet allocatioin.
Exact inference of LDA is intractable. Approximate inference can be done using variational methods. In particular, if a Dirichlet prior is assumed to generate the set of multinomial components (shared by mixture models of documents), these components can be integrated out, and the model inference can be done using Gibbs sampling.
Hierarchical Dirichlet Process
In each bag j, each objects x_{ji} is generated by its correspond component θ_{ji}. In the bag, all θ_{ji}‘s are samples of G_{j}, a Dirichlet process (which encodes components and their weights within bag j). In the introduction of LDA, we explained that we hope that Gj’s of various documents share components. This can be done by defining all G_{j}‘s on the same measure space, which covers all possible values of θ, and assuming that Gj’s are samples of a global Dirichlet process G_{0}. Note that G_{j}~G_{0} mimics a multinomial sampling, and θ_{ji}~G_{j} also mimics a multinomial sampling, so we put a Dirichlet process prior on G_{0} and make the whole generation process conjugate, thus G_{0} and G_{j}‘s can all be integrated out. This is key to estimate HDP using Gibbs sampling.
Hidden Markov Model
In mixture model, components are exchangable. If we assume firstorder temporal dependency between components, mixture model becomes hidden Markov model (HMM), where each output pdf of the HMM corresponds to a component of the mixture model. Inference of HMM can be done using the Viterbi algorithm. A usual hierarchical extension of HMM is to extend each output pdf by a mixture model.
Infinite Hidden Markov Model
HMM assumes the state/component in time t, denoted by θ_{t}, depends only on θ_{t1}. Some highorder extensions to HMM assumes that θ_{t} depends only on θ_{t1}, … θ_{tL}, where L is the known as the order of HMM and reflect the length of memory/history. Most highorder HMMs assumes L a fixed positive integer, and can be represented by a firstorder HMM whose each state corresponds to a combination of θ_{t1}, … θ_{tL}. In contrast to these fixedorder HMMs, the VariableLength HMM (VLHMM) learns a set of combinations with variable lengths. This is under the intuition that in some times we need only short memory/history to predict the future accurately, while in some times, we need longer memory/history. Comparing with VLHMM, which facilitates a prefixtree prunning approach to learn the memories with max length L, the infinite HMM seems much more elegant and can learn memories with infinite length. (I need to read [Beal etc, The Infinite Hidden Markov Model] carefully to ensure that above description is not wrong).