Latent Variables & Expectation Maximization Algorithm

Bayesian Approach to Machine Learning

May 30, 2020

17 min read

👁 Latent 'Smoke': (Picture Credit: Author)

Latent ‘Smoke’: (Picture Credit: Author)

‘Latent’, originated from Latin which means being hidden. Probably you have all known about latent heat, which is the heat energy required for phase transformation keeping the temperature constant. So we observe a change, but the reason behind it is apparently hidden. The motivation for Latent Variable Model (LVM) is to explain the surface structure of the data with some underlying hidden variable. In this post, we will try to understand LVMs and very well known algorithm to deal with such models, known as Expectation Maximization (EM) algorithm through an example. Personally I think this topic is one of the cores of Bayesian Machine Learning and, it will involve lots of math. But, believe me, it’s gonna be easy. Also, understanding the core concepts of EM algorithm will help you truly appreciate the basis of variational autoencoder and eventually GAN. What you can expect to learn from this post:

Why Latent Variable Model (LVM)?
Gaussian Mixture Models (GMM).
Expectation Maximization Algorithm.
Variational Inference and EM Algorithm.
Example of training GMM using EM Algorithm.

Latent Variables and LVM:

Why hidden variables? Latent variables could be some theoretical concept, or real physical variables that are unobserved. Let’s elaborate with few realistic examples. In a text document, extracted ‘words’ could be treated as features. By factorizing these features, we can find a ‘topic’ for the document. So the observed features (‘words’ – branches, roots, leaves, flowers, fruits) would factorize to latent feature (‘topic’ – Tree, garden). Whenever we have large scale, high dimensional noisy features, it makes sense to build a classifier on latent features. In the case of an object recognition task, each data point (the matrix of pixel intensities) corresponds to an image of one of the objects. In such a case, the latent variable might have an interpretation as the position orientation of the object. As Chris Bishop in his Pattern Recognition book mentioned –

The primary role of the latent variables is to allow a complicated distribution over the observed variables to be represented in terms of a model constructed from simpler (typically exponential family) conditional distributions.

Generally a LVM p, is a probability distribution over 2 sets of variables x, z; p(x, z). x are observed variables at learning time in a data-set D and z are never observed. Joint probability distribution of the model can be written as p(x, z) = p(x|z) p(z).

You will see that developing the fitting algorithm of LVMs is rather difficult but, there are few benefits of studying this –

LVMs often have fewer parameters than original models, which directly represent correlation in the visible space.
LVs in LVMs mostly serve as a compressed representation of the data, which is also known as ‘bottleneck’. This is the basis of unsupervised learning. If you think of an image (ex. human face) as the observed variable x then, the latent variable z could encode the features of the face (which are not seen during training), like it can encode whether the face is happy or sad, male or female etc.
LVMs can also help us to deal with missing data. The previous point and this one are related. If we can do a posterior inference on the latent variables i.e., given an image x what are the latent factors z, then once we find some image with missing parts, the latent factors can be used for reconstruction.

To discuss and develop a realistic LVM, we will formulate Gaussian Mixture Model (GMM) in terms of discrete latent variables. Since the most general technique for obtaining the maximum likelihood estimators in LVMs is the EM algorithm, we will discuss that in detail too in the coming sections.

Probabilistic Clustering and GMM:

Before in my DBSCAN algorithm post I have discussed drawbacks of K-Means algorithm especially, in dealing with spatial clusters of different density, size and data-points including noise and outliers. Another problem of K-Means is that it performs hard clustering. Let’s see an example using the simple [make_blobs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html) data-set. Below in figure 1, we see the data-set with 3 clusters with and without labels. Once we apply K-Means algorithm it will identify the 3 clusters but, it is not possible to assign any probability to the data points at the boundaries of the clusters. What if we want to tell the probability or likelihood of each data-point to belong in these three clusters (especially significant for the points at the boundaries of the clusters)?

👁 Figure 1: Make Blobs Data-set (left) and same data-set including labels (right). Source: Author

Figure 1: Make Blobs Data-set (left) and same data-set including labels (right). Source: Author

In figure 2, we have shown an example of applying Gaussian Mixture Model (GMM) clustering to the data-set and the size of each data-point is proportional to the certainty of the prediction of GMM. This is compared with the K-Means result and you can clearly see the difference especially for the points at the boundaries.

👁 Figure 2: Comparison of soft clustering (right) using GMM is depicted by assigning certainty of prediction to each data-point. This compared with hard clustering-K-Means (left). Source: Author

Figure 2: Comparison of soft clustering (right) using GMM is depicted by assigning certainty of prediction to each data-point. This compared with hard clustering-K-Means (left). Source: Author

With this soft clustering as a motivation in mind, let’s build up the GMM using Latent Variables but before that let’s just recall the multivariate normal distribution. The symbols used here are pretty usual.

👁 Exp 1: Multivariate normal distribution.

Exp 1: Multivariate normal distribution.

We will use mixture of normal distributions to develop a probabilistic clustering (soft clustering) model. The naive way to think of GMM as linear superposition of Gaussian components. Each component will have some probability associated with it to represent a particular data-point.

👁 Exp 2: Mixture of Gaussian Distributions.

Exp 2: Mixture of Gaussian Distributions.

The πs are known as mixing coefficients in the Exp. 2. For a single data point the sum is over the number of clusters K. The joint probability distribution of variables x and z (latent)is: p(x, z) = p(x|z) p(z). So the marginal distribution over data point x will be given by: p(x) = ∑ p(x|z) p(z) dz (obtained by summing/integrating over variable z). One way to think of this p(x) is a mixture of distributions p(x|z), weighted by p(z). For the GMM context, we can think of z as a binary random variable with 1 of K representation (you can think of this as categorical representation), in which a particular element is equal to 1 and all other elements are 0.

👁 Exp 3: Marginal distribution over latent variable.

Exp 3: Marginal distribution over latent variable.

Now the mixing coefficients and latent variables z, all fall into a place when we consider the marginal distribution over z is specified in terms of the mixing coefficients as defined in the expression above. The conditions satisfied by mixing coefficients are necessary so that they can effectively be interpreted as probability. We think of the kth mixing coefficient (π_k) as the prior probability of z_k equal to 1. As the latent variables use 1 of K representation (think of this as categorical representation), another way we can write p(z) and corresponding conditional distribution is as below –

👁 Exp 4: Marginal and Conditional distribution

Exp 4: Marginal and Conditional distribution

So combining the multivariate normal distribution and latent variable distribution we can rewrite the GMM model once more as below.

👁 Exp 5: GMM in terms of mixing coefficients and multivariate normal distribution.

Exp 5: GMM in terms of mixing coefficients and multivariate normal distribution.

The above expression is valid for a single data-point. For several data-points we will have a corresponding latent variable associated with each observed data-point. Rather than working with only the marginal distribution p(x), we can now work with the joint distribution p(x, z). Did we make things more complicated? To answer this we will have to dive little deeper in the mathematics of likelihood estimation. The section below will reveal itself how latent variables will help us to address the Maximum Likelihood Estimate (MLE) easily.

Expectation Maximization Algorithm (Motivation):

Our main stimulus for GMM was to do soft clustering of data-points. General intuition to obtain the best parameters for each cluster (mixing coefficient, cluster mean, covariant matrix) would be to perform the usual MLE task. Above in expression 4, we have written the GMM model for a single data-point. Let us consider a data-set with N points, such that the set of observation will be {x1, x2, x3, …., xN}. For N data-points the likelihood function will be as below –

👁 Exp 6: Log likelihood function including all the N number of data-points.

Exp 6: Log likelihood function including all the N number of data-points.

The biggest problem with this expression is that the logarithm is not directly acting on the Gaussian but on the sum of the mixtures of Gaussian. Thus we have no advantage over logarithm acting on exponents. So indeed this expression is difficult to reduce. Appreciate that this isn’t a problem for a single Gaussian. Apart from the previous problem, another issue that could arise is when one of the components of GMM explains only a single point. This means one of the μ’s is equal to a data-point x_n. In the limit of a variance tending to 0, the likelihood term will behave like a delta function. So maximizing likelihood isn’t a good idea in this particular problem. Another way to think about the Gaussian mixture problem in comparison with a single Gaussian distribution problem is that while the single Gaussian model belong to the exponent family which have concave log-likelihood (tractable for simple enough families), the mixture of components won’t have this property. So all these reasons call for a new method to determine the parameters and, it’s about time to dive into Expectation Maximization (EM) algorithm, which is indeed based on Bayesian update rules.

EM Algorithm:

Before we learn EM algorithm, we need to first derive some important expressions. First, we will use Bayes’ theorem to get the expression for posterior distribution of z given x from the conditional distribution p(x|z), which is as below –

👁 Exp 7: Posterior distribution of latent variables

Exp 7: Posterior distribution of latent variables

The γ-term is known as responsibility, and it represents the responsibility the kth component of latent variable z takes to explain the observation x. As the name suggests, EM algorithm relies on 2 simple steps: Expectation (E-step) and Maximization (M-step) –

a). E-step: Expectation step is where we calculate the posterior distribution, i.e. the responsibilities (exp: 5) based on the current parameters. This is more of solving the inference problem i.e. which Gaussian is responsible for which data-point?

b). M-Step: The obtained probabilities from E-step are then used to calculate the updated parameters (μ, Σ, and π).

After several iterations this process will converge and we obtain the best fit values of the parameters. Let’s review the steps of EM algorithm more generally –

👁 Exp 8: Steps of EM algorithm in general

Exp 8: Steps of EM algorithm in general

Here it is important to mention that best parameters can be obtained once we know the cluster assignment (which is partially what we are trying on E-step). Also appreciate that because of the obtained posterior probabilities from E-step, we can use the definition of expectation and then we need to maximize the expectation of the complete data likelihood instead of the data likelihood. How we get into this expression of the expectation, will be more clear in the coming section but for now remember that in terms of responsibility (defined in exp. 7) we can calculate the [1] parameters (μ , Σ , π) and the expressions are as below –

👁 Exp 8: Parameters obtained from MLE.

Exp 8: Parameters obtained from MLE.

Variational Lower Bound and EM Algorithm:

It was just mentioned, until convergence repeat the steps of EM algorithm. You might think how to guarantee that the algorithm will converge? Let’s dig deeper! We started the GMM saying that for every data-point _xn there’s a latent variable _zn. The log likelihood of the complete data-set takes the form as below –

👁 Exp 9: The complete likelihood including latent variables

Exp 9: The complete likelihood including latent variables

As mentioned before, the problem with this expression is Log is acting on the sum of the mixtures of the Gaussian. We will play little trick to do away with this issue. Let’s introduce arbitrary probability distribution (without any special property) over latent variable – q(z) and now we can modify the expression above –

👁 Exp 10: Little modification in the data likelihood

Exp 10: Little modification in the data likelihood

Using the definition of Expectation value this expression can be further simplified as –

👁 Exp 11: Use the definition of expectation

Exp 11: Use the definition of expectation

The expectation term can be read as expected value of _zi drawn from a distribution _qi. We can compare this expression with Exp. 8 above to understand how summing over the latent variables gives us the expectation. In Exp. 11, Log is operating on the expectation value. Given that Log is a concave function, let’s apply Jensen’s inequality to further simplify the expression –

👁 Exp 12: Use Jensen's inequality to modify Exp 11.

Exp 12: Use Jensen’s inequality to modify Exp 11.

The expression above says that log likelihood of the complete data-set is greater than or equal to the above expectation. The expectation term is known as Evidence Lower Bound (ELBO) and, expanding the logarithm and applying Bayes’ theorem we can reach more meaningful conclusion –

👁 Exp 13: Evidence Lower Bound

Exp 13: Evidence Lower Bound

Using the definition of KL divergence the above expression becomes –

👁 Exp 14: ELBO and KL Divergence

Exp 14: ELBO and KL Divergence

As KL divergence ≥ 0, this means ELBO ≤ log p(x). Now we know why it is called a lower bound. Our objective is to maximize our data likelihood log p(X), and instead we maximize the ELBO. Based on this expression (Exp. 14) let’s review the steps of EM algorithm one more time –

Start the algorithm by randomly initializing the parameters.
E step: Maximize the ELBO w.r.t q(z) and keep the parameters (π, μ, Σ) fixed. Since log p(X) is independent of q(z), this step is essentially just calculating the KL divergence and in case of GMM the q(z) is set equal to the posterior. This is possible since the calculation of posterior for every data-point is possible in GMM.
M-step: ELBO is maximized w.r.t to the parameters while q(z) is kept fixed and this is very similar like MLE estimation. In this step the parameters are updated.

Why Convergence:

We started this section with a question- why EM algorithm ensures convergence? EM algorithm is an iterative process and thus E and M step goes on in cycle. It is important to remember that in each steps of EM algorithm, first the distribution q(z) is set equal to the posterior p(z|x) (E-step), and the parameters are updated in the M step from maximization. With the updated parameters the lower bound increases (unless the algorithm already reached a local maxima). This, in turn, increases the log likelihood but, logp(x) increases more than the lower bound because, while q(z) was obtained for old parameters but, now in E step when we calculate the posterior p(z|x) again and this time using the new parameters from M-step, gives rise to a non-zero KL divergence term. Thus the lower bound, as well as the data likelihood, increases monotonically. To ensure that the algorithm does not get stuck in a local maxima, we run the algorithm for different initial values of the parameters. This monotonic increasing nature of the data-likelihood is what we wanted to show as a proxy for convergence. This sometimes is very helpful to debug the code if we see the likelihood is not increasing in each step. To run the algorithm we will give the number of iteration as a parameter and also set a tolerance for the fractional increase in loss function. Now, it is time to put all the theory into action and let’s code!! Please check the notebook for detailed code along with other references in the reference section.

Implement EM Algorithm with Python:

Let’s start by defining E-step. Here our main objective is to define the posterior distribution, before we have defined this as responsibility in Exp. 7. Below is the code block –

👁 E-step of EM algorithm (for GMM)

E-step of EM algorithm (for GMM)

Next we implement M-step which is about the maximizing the expectation and for GMM we already know the best parameters, once we’ve obtained the posterior from E-step. The parameters from MLE is given in Exp. 8 and we will use them to define M-step as below –

👁 M-step for GMM

M-step for GMM

👁 Exp. 15: ELBO in terms of Likelihood and posterior

Exp. 15: ELBO in terms of Likelihood and posterior

Once we have defined E and M step, now we will implement Evidence Lower Bound as the loss function. We modify the lower bound expression from Exp. 12 to as shown in Exp. 15. For the distribution of latent variables q(z), we will use the posterior calculated from E-step . With these in mind let’s define ELBO as below—

👁 Image

Now, we are all set to define the training loop where, we set the number of iterations, tolerance and also make sure that the algorithm starts several times with randomly initialized parameters so that it doesn’t get stuck in local maxima. During training we need to specify the number of clusters and for complicated problems this could be one issue. Since EM algorithm is time consuming, usually it is suggested that running K-Means algorithm to determine no. of clusters and then run EM for GMM. Else, another way to determine this is to use Akaike Information Criterion. For my PhD research I did use AIC for separating astrophysical models but that is different from our current purpose. The code for the training loop is below –

Finally, after training we can plot the contours –

👁 Figure: Left: 2D Data-set, clustered into 3 mixtures of Gaussian (right). Source: Author

Figure: Left: 2D Data-set, clustered into 3 mixtures of Gaussian (right). Source: Author

Using Scikit-Learn:

All of these understandings and corresponding long codes for EM algorithm in GMM can be implemented with 3 lines of code using the GaussianMixture class of sklearn. But life is always better when you know the basics! Below is the code block –

👁 With scikit-learn implementing GMM is easy!

With scikit-learn implementing GMM is easy!

Conclusion :

We have made it till the end of this work and in the process we learned several important concepts of Bayesian approach to unsupervised learning. Among many important concepts I would like you to at least take back the most important ones –

Updating the probability distribution based on prior knowledge through the iterative steps of EM algorithm. This is the basis of Bayesian inference – how our belief changes once we update the data (evidence).
Why introducing the latent variables help us to better understand the data and eventually help in the calculation of maximum likelihood?
Relation of KL divergence and ELBO through the likelihood of the data. This you will encounter again and again in the concepts of generative models (GMM is a simple example of generative model).
For GMM the posterior for a single data point can be calculated analytically, this may not be the case in complex models (most of the times the posterior would be intractable) and thus, it cannot be used as a proxy for the distribution of latent variables ( q(z) = p(z|x) ). In this case, we try to select q from a family of distributions Q that is as close as possible to the posterior distribution. This is the basis of Variational Bayes and eventually variational autoencoder (VAE) and Generative Adversarial Network (GAN).

Thanks for reading, hopefully this will help you a bit. Cheers !!!

P.S: I covered another important concept in Bayesian machine learning, known as Conjugate Priors on a separate post.

References:

[1] EM algorithm: Andrew NG Notes from Stanford Repository.

[2] Variational Inference: Review by David Blei.

[3] Chris Bishop’s Pattern Recognition Book: Chapter 9.

[4] Link to Notebook : GitHub.

Written By

Saptashwa Bhattacharyya

See all from Saptashwa Bhattacharyya

Bayesian Machine Learning, Clustering, Data Science, Machine Learning, Towards Data Science

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/latent-variables-expectation-maximization-algorithm-fb15c4e0f32c/