An Introduction to Variational Autoencoders

These people don’t exist! (source)

Variational Autoencoders (VAE) are really cool machine learning models that can generate new data. It means a VAE trained on thousands of human faces can new human faces as shown above!

Recently, two types of generative models have been popular in the machine learning community, namely, Generative Adversarial Networks (GAN) and VAEs. While GANs have had more success so far, a recent Deepmind paper showed that VAEs can yield results competitive to state of the art GAN models. Furthermore, VAE generated images retain more of the diversity of training dataset than GAN counterparts.

VQ-VAE samples (left) vs BigGAN samples (right). (source)

I understand that some of you don’t like to read long blog posts. No worries! I’ve added a summary at the end.

What is an Autoencoder?

Before I talk about Variational Autoencoders, I need to tell you what an autoencoder is. It looks something like this:

A simple autoencoder. The input is represented as X and the corresponding output is represented as X’. (source)

You can think of it as a “smart” compression algorithm that fits the data at hand. The encoder is a neural network that compresses the input (usually an image) to a much smaller space. The output of the encoder is called the embedding of the input. The decoder is another neural network that takes this tiny embedding as input and tries to reconstruct the original input (image). The goal here is obviously to find embeddings and weights for the encoder and the decoder such that the reconstructed output is as close to the original input as possible. The whole model is trained at once with either pixel-wise binary cross entropy or L2 norm of the pixel-wise value difference as loss function.

Note: The embedding space of an autoencoder is also called the latent space. I shall use these terms interchangeably from here on.

The latent space of autoencoders is interesting for many reasons. I’ll give some example use cases below:

  1. The embeddings from autoencoders capture high level information about the input and filter out noise. Hence, they are very useful for dimensionality reduction.
  2. The ability to filter out noise is also useful in denoising the data. In order to do so, first you train an autoencoder on clean data. Then you add noise to the clean data and train the autoencoder on these noisy inputs. The reconstruction loss is however computed using the original clean images. In this way autoencoders learn to associate these noisy images with their clean counterparts, and are able to denoise similar images. Here is an example:

    Left: Clean MNIST digits. Center: Same digits after adding noise. Right: reconstructed digits from the noisy data.(source)
  3. The structure of this embedding space might reveal interesting information about the data. Here is an example from wikipedia:
    The right image represents embedding space of an autoencoder. (source)

    The right image represents the latent space of an autoencoder. The input had 10 class labels. Inputs from different classes were mapped to different locations of the input space and the inputs from the same class were clustered together.

    Now consider a scenario where you have thousands of images but only few of them (1-2%) are labelled. In order to recover labels for the rest of the images, you can run a clustering algorithm in the latent space of an autoencoder using labelled data points as cluster centers. It will work because autoencoders map ‘similar’ images close to each other in the embedding space. This is the basic idea behind a many techniques in semi-supervised learning.

  4. Every point in the embedding space is technically an embedding for some input. If you attempt to decode a random point \(y\) close to an embedding \(x\), what should you get? Something that looks like the reconstructed image of \(x\), but a bit different (because \(y\neq x\)). The decoder should generate a new, previously unseen input data point!! That is how you generate new data.

Unfortunately, this doesn’t work well in practice. As you can see above, the latent space of an autoencoder is sparse — there can be gaps and empty spaces in it. If you happen to decode a point from one of such gaps, you will get some garbage output because the decoder has no idea of how to interpret those regions. Furthermore, how would you fine-tune generated images? For instance, you might want to generate the same face, but alter the expression. This would be very hard to do with simple autoencoders.

Introducing Variational AutoEncoders

The idea behind VAE is to encode the input not just as a point, but as a probability distribution. Thus, every input is encoded into a region of the latent space. A sample from the probability distribution is decoded. Consequently, many similar images can be reconstructed from the same embedding.

It is common to use multivariate normal distributions for embeddings, but you can use any decent distribution of your choice. It is also assumed that the embeddings of the all data points follow a standard multivariate normal distribution.

A multivariate normal distribution has two parameters, namely mean \(\mu\) (a vector) and covariance matrix \(\Sigma\). The encoder neural network has to predict both. In order for simplicity, covariance matrix is assumed to be of the form \(\sigma^2I\) so that the encoder has to predict only the vector component \(\sigma^2\). For instance, if \(\sigma^2=[1.2, 0.1]\), then \(\Sigma=[[1.2, 0],[0, 0.1]]\). (For the math nitpickers, you can think of it as rewriting the covariance matrix in terms of its eigenbasis, which is orthonormal)

In order to generate an output similar to a particular input, you take a sample from its embedding distribution and pass it to the decoder. Here is how this architecture looks like:

The architecture of a Variational Autoencoder (source)

The input \(x\) is passed into the encoder component, which then predicts the \(\mu_x\) and \(\sigma_x\) vectors. Using these parameters, we take a sample \(z\sim\mathcal N(\mu_x, \sigma_x^2I)\). The decoder then takes input \(z\) and outputs a reconstructed image \(x’\).

The sampling procedure itself also matters. We shall come back to it later.

The Loss Function

The loss function of the variational autoencoder comes from a set of optimization techniques called Variational Bayesian (VB) Methods. The basic idea is that you’re trying to approximate some intractable posterior distribution with some known simple distribution. Of course, the approximation won’t be perfect. It only needs to be good enough. This is how the normal distribution comes into play in VAE.

The objective of the VB methods is called the Evidence Lower Bound (ELBO), which I am not going to derive here. You just need to know that the ELBO for Variational Autoencoders turns out to be

\[\mathbb{E}_{z\sim q(z|x)}[\log(p(x|z))]-KL(q(z|x), p(z))\]

Here \(KL(\cdot, \cdot)\) stands for the KL divergence. ELBO objective is maximized. So, the loss VAE minimizes negative of the ELBO objective:

\[-\mathbb{E}_{z\sim q(z|x)}[\log(p(x|z))]+KL(q(z|x), p(z))\qquad (1)\]

For those people who just want to know how to implement this, the loss is

&-BCE(\text{x}, \text{x’})+KL(\mathcal N(\mu_x, \sigma_x^2I), \mathcal N(0, I))\\
=&-BCE(\text{x}, \text{x’})+\frac{1}{2}\sum_j 1+\log(\sigma_x(j)^2)-\mu_x(j)^2-\sigma_x(j)^2\qquad (2)

where \(\mu_x(j)\) and \(\sigma_x(j)\) stands for the \(j\)th coordinate of \(\mu_x\) and \(\sigma_x\), respectively, and BCE stands for Binary Cross Entropy. If you want to understand why (1) and (2) are equivalent, keep reading, otherwise you may skip to next section.

Let’s start with the variable names in equation (1). Here, \(x\) stands for the input variable and \(z\) stands for the embedding vector. \(q(z|x)\) is the embedding of input \(x\). This is a probability distribution. Similarly, \(p(x|z)\) is the reconstructed input. \(z\sim q(z|x)\) stands for the sampling procedure we described above. \(p(z)\) is the prior about the latent space. It is typically assumed to be a standard normal distribution. \(\log(p(x|z))\) is just the log likelihood. There is an expectation surrounding it because we’re sampling \(z\) from a distribution. We can replace it with an average by sampling \(z\) multiple times. So, the first term of the loss just becomes negative log likelihood. In the context of images, pixel-wise binary cross entropy makes more sense. This term tries to minimize the loss between original image and the reconstructed image, same as a classic autoencoder.

As you know from above the embedding distribution of the input \(x\) is \(q(z|x)\). The KL divergence term tries to force this distribution to be close to \(p(z)\), which is by assumption a multivariate standard normal. This forces all the embeddings to be close to each other so that there is no gap in the embedding space. Now, the choice of prior depends on the application. For instance, while working on a semi-supervised VAE model, my friends and I found that Gaussian Mixture prior outperforms standard Gaussian prior. However, keep in mind that if you change the prior, you also need to replace the close form of the KL divergence term.

The Sampling Procedure

Remember when I said that how we sample \(z\) matters? This is because if you just do naive sampling, the backprop becomes hard as passing gradient through a stochastic process is tricky. The gradient may also exhibit high variance. (Some blog posts erroneously mention that passing gradient is impossible, which is not the case!)  As a remedy, the authors of VAE paper proposed the so called reparameterization trick. The idea is to sample an \(\epsilon\sim\mathcal N(0, 1)\) and compute \(z=\mu+\sigma\cdot \epsilon\). In this scenario, it’s possible to pass gradients through \(\mu\) and \(\sigma\) because they are fixed. The gradients cannot be passed through \(\epsilon\) easily, but that’s okay because we do not update it in the model.


  1. Variational Autoencoder encodes inputs as probability distributions as opposed to vectors.
  2. Multivariate normal distribution is commonly used, any other ‘nice’ distribution is also okay.
  3. You train the VAE with the loss \(-\mathbb{E}_{z\sim q(z|x)}[\log(p(x|z))]+KL(q(z|x), p(z))\). \(z\) is sampled indirectly using the reparameterization trick: \(z=\mu+\sigma\cdot\epsilon\) where \(\epsilon\sim\mathcal N(0, 1)\). The decoder reconstructs input from \(z\). The expectation in the loss is computed by taking average of the log likelihood for multiple samples of \(z\).
  4. Once the network is trained, you sample a \(z\) and decode it with the decoder to generate new input.


The most notable characteristic of VAEs is that they allow smooth interpolation through the embedding space. For instance, when generating faces, you can “walk” through the embedding space to change hair color without changing other facial features. This is remarkable because each of those embeddings is decoded by the decoder. If the embedding space didn’t have intricate structure, the decoder would abruptly output completely different images.

Latent space interpolation (source)

That’s it! Now you understand variational autoencoders. Obviously, there is still a lot more to learn. Many interesting VAE papers have been published in the last few years, for instance, Disentangled VAE, Semi-Supervised VAE, VQ-VAE2 etc. I shall cover some of them in future blog posts.

Hope you have enjoyed this blog post. Let me know what you think in the comments.

Leave a Reply

Your email address will not be published. Required fields are marked *