Understanding key technologies behind image generation.
Variational Autoencoders (VAE)
- A probabilistic generative model aims to learn a distribution p(x) over the (observable) data.
- Latent variable models use two simple distributions to express an unknown complex distribution.
- The (lower-dimensional) latent variable is assumed to conform to a standard Gaussian distribution.
- The likelihood of the observable data is a parametric Gaussian distribution (the decoder network).
- Once the parameters are known, a new sample of x can be generated through ancestral sampling.
- Even calculating the data log-likelihood of the parameters is intractable, let alone maximizing it for the training purpose.
- Instead, introduce the parametric evidence lower bound (ELBO) to approximate the data log-likelihood.
- To maximize the ELBO is equivalent to making a variational distribution q(z) approximate p(z|x).
- Typically, q(z) is a Gaussian distribution with parametric mean and variance that depend on x (the encoder network).
- The first term of the ELBO can be estimated via sampling.
- The second term, as the KL-divergence between two Gaussians, can be calculated in closed form.
- Thus, calculating the ELBO is tractable given the parameters.
- The VAE algorithm:
- Given an x, calculate the mean and variance of the variational distribution q(z).
- Given q(z), draw a sample of the latent variable z.
- Given the sample z, evaluate the parametric likelihood of x. - How to evaluate the gradient of the VAE despite the sampling step?
- The reparameterization trick rewrites a random variable as a deterministic function of some noise variable.
- The backpropagation algorithm only needs to pass through the non-stochastic branch. - To estimate the marginal distribution p(x), use the encoder to draw samples of z and do importance sampling.
Diffusion Models
- A diffusion model consists of a forward diffusion process (the encoder) and a reverse process (the decoder).
- The diffusion process blurs observed data in an autoregressive way.
- The latent variables are of the same dimension as the observable data.
- Unlike the parametric encoder in a VAE, it is fixed to a Markov chain, where each transition is a Gaussian distribution with fixed constant variances (a.k.a. the noise schedule).
- With sufficiently many steps, the marginal distribution of the final latent variable is considered as a standard Gaussian distribution.
- The reverse process is another Markov chain with learned Gaussian transitions starting from a standard normal prior at t=T.
- The true reverse process consists of complex multi-modal distributions that depend on p(x).
- But we approximate them as Gaussian distributions with parametric means and constant variances, provided that each step is small enough.
- As in the VAE, we can generate samples of the observable data through ancestral sampling.
- The forward process admits sampling the latent variable at an arbitrary time step t in closed form.
- This is known as the diffusion kernel, which is also Gaussian.
- The closed-form diffusion kernel leads to a closed-form forward process posterior, which is also Gaussian.
- As for the VAE, to maximize the data log-likelihood, we maximize the ELBO.
- The ELBO is separable for different steps of the reverse process, which enables doing gradient descent for different time steps separately.
- The loss term for each time step can be reparameterized as the squared difference between a standard Gaussian noise and the predicted noise.
- The noise predictor can be a neural network (e.g. U-Net) taking in a latent variable, a vector representing the time step, and (optionally) a time-invariant embedding for class information.
References:
- Simon J. D. Prince. Understanding Deep Learning. MIT Press, 2023.
- Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. NeuralIPS, 2020.