[Book Notes] Variational Autoencoders and Diffusion Models

Devin Z
4 min readMar 31, 2024

Understanding key technologies behind image generation.

Google Moffett Place Campus, February 21, 2024

Variational Autoencoders (VAE)

  • A probabilistic generative model aims to learn a distribution p(x) over the (observable) data.
  • Latent variable models use two simple distributions to express an unknown complex distribution.
    - The (lower-dimensional) latent variable is assumed to conform to a standard Gaussian distribution.
    - The likelihood of the observable data is a parametric Gaussian distribution (the decoder network).
    - Once the parameters are known, a new sample of x can be generated through ancestral sampling.
Latent Variable Model
  • Even calculating the data log-likelihood of the parameters is intractable, let alone maximizing it for the training purpose.
Maximum Likelihood Estimate
  • Instead, introduce the parametric evidence lower bound (ELBO) to approximate the data log-likelihood.
    - To maximize the ELBO is equivalent to making a variational distribution q(z) approximate p(z|x).
    - Typically, q(z) is a Gaussian distribution with parametric mean and variance that depend on x (the encoder network).
    - The first term of the ELBO can be estimated via sampling.
    - The second term, as the KL-divergence between two Gaussians, can be calculated in closed form.
    - Thus, calculating the ELBO is tractable given the parameters.
Evidence Lower Bound
  • The VAE algorithm:
    - Given an x, calculate the mean and variance of the variational distribution q(z).
    - Given q(z), draw a sample of the latent variable z.
    - Given the sample z, evaluate the parametric likelihood of x.
  • How to evaluate the gradient of the VAE despite the sampling step?
    - The reparameterization trick rewrites a random variable as a deterministic function of some noise variable.
    - The backpropagation algorithm only needs to pass through the non-stochastic branch.
  • To estimate the marginal distribution p(x), use the encoder to draw samples of z and do importance sampling.
Approximating Sample Probability

Diffusion Models

DDPM¹
  • A diffusion model consists of a forward diffusion process (the encoder) and a reverse process (the decoder).
  • The diffusion process blurs observed data in an autoregressive way.
    - The latent variables are of the same dimension as the observable data.
    - Unlike the parametric encoder in a VAE, it is fixed to a Markov chain, where each transition is a Gaussian distribution with fixed constant variances (a.k.a. the noise schedule).
    - With sufficiently many steps, the marginal distribution of the final latent variable is considered as a standard Gaussian distribution.
The Forward Process
  • The reverse process is another Markov chain with learned Gaussian transitions starting from a standard normal prior at t=T.
    - The true reverse process consists of complex multi-modal distributions that depend on p(x).
    - But we approximate them as Gaussian distributions with parametric means and constant variances, provided that each step is small enough.
    - As in the VAE, we can generate samples of the observable data through ancestral sampling.
The Reverse Process
  • The forward process admits sampling the latent variable at an arbitrary time step t in closed form.
    - This is known as the diffusion kernel, which is also Gaussian.
Diffusion Kernel
  • The closed-form diffusion kernel leads to a closed-form forward process posterior, which is also Gaussian.
Forward Process Posterior
  • As for the VAE, to maximize the data log-likelihood, we maximize the ELBO.
    - The ELBO is separable for different steps of the reverse process, which enables doing gradient descent for different time steps separately.
    - The loss term for each time step can be reparameterized as the squared difference between a standard Gaussian noise and the predicted noise.
    - The noise predictor can be a neural network (e.g. U-Net) taking in a latent variable, a vector representing the time step, and (optionally) a time-invariant embedding for class information.
Evidence Lower Bound

References:

  1. Simon J. D. Prince. Understanding Deep Learning. MIT Press, 2023.
  2. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. NeuralIPS, 2020.

--

--