[Book Notes] Variational Autoencoders and Diffusion Models

4 min readMar 31, 2024

Understanding key technologies behind image generation.

Variational Autoencoders (VAE)

A probabilistic generative model aims to learn a distribution p(x) over the (observable) data.
Latent variable models use two simple distributions to express an unknown complex distribution.
- The (lower-dimensional) latent variable is assumed to conform to a standard Gaussian distribution.
- The likelihood of the observable data is a parametric Gaussian distribution (the decoder network).
- Once the parameters are known, a new sample of x can be generated through ancestral sampling.

Even calculating the data log-likelihood of the parameters is intractable, let alone maximizing it for the training purpose.

Maximum Likelihood Estimate

Evidence Lower Bound

The VAE algorithm:
- Given an x, calculate the mean and variance of the variational distribution q(z).
- Given q(z), draw a sample of the latent variable z.
- Given the sample z, evaluate the parametric likelihood of x.
How to evaluate the gradient of the VAE despite the sampling step?
- The reparameterization trick rewrites a random variable as a deterministic function of some noise variable.
- The backpropagation algorithm only needs to pass through the non-stochastic branch.
To estimate the marginal distribution p(x), use the encoder to draw samples of z and do importance sampling.

A diffusion model consists of a forward diffusion process (the encoder) and a reverse process (the decoder).
The diffusion process blurs observed data in an autoregressive way.
- The latent variables are of the same dimension as the observable data.
- Unlike the parametric encoder in a VAE, it is fixed to a Markov chain, where each transition is a Gaussian distribution with fixed constant variances (a.k.a. the noise schedule).
- With sufficiently many steps, the marginal distribution of the final latent variable is considered as a standard Gaussian distribution.

The Forward Process

The reverse process is another Markov chain with learned Gaussian transitions starting from a standard normal prior at t=T.
- The true reverse process consists of complex multi-modal distributions that depend on p(x).
- But we approximate them as Gaussian distributions with parametric means and constant variances, provided that each step is small enough.
- As in the VAE, we can generate samples of the observable data through ancestral sampling.

The Reverse Process

The forward process admits sampling the latent variable at an arbitrary time step t in closed form.
- This is known as the diffusion kernel, which is also Gaussian.

The closed-form diffusion kernel leads to a closed-form forward process posterior, which is also Gaussian.

Forward Process Posterior

As for the VAE, to maximize the data log-likelihood, we maximize the ELBO.
- The ELBO is separable for different steps of the reverse process, which enables doing gradient descent for different time steps separately.
- The loss term for each time step can be reparameterized as the squared difference between a standard Gaussian noise and the predicted noise.
- The noise predictor can be a neural network (e.g. U-Net) taking in a latent variable, a vector representing the time step, and (optionally) a time-invariant embedding for class information.

Evidence Lower Bound

Simon J. D. Prince. Understanding Deep Learning. MIT Press, 2023.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. NeuralIPS, 2020.