Understanding key technologies behind image generation.

## Variational Autoencoders (VAE)

- A probabilistic generative model aims to learn a distribution
*p(x)*over the (observable) data. - Latent variable models use two simple distributions to express an unknown complex distribution.

- The (lower-dimensional) latent variable is assumed to conform to a standard Gaussian distribution.

- The likelihood of the observable data is a parametric Gaussian distribution (the decoder network).

- Once the parameters are known, a new sample of*x*can be generated through ancestral sampling.

- Even calculating the data log-likelihood of the parameters is intractable, let alone maximizing it for the training purpose.

- Instead, introduce the parametric evidence lower bound (ELBO) to approximate the data log-likelihood.

- To maximize the ELBO is equivalent to making a variational distribution*q(z)*approximate*p(z|x)*.

- Typically,*q(z)*is a Gaussian distribution with parametric mean and variance that depend on*x*(the encoder network).

- The first term of the ELBO can be estimated via sampling.

- The second term, as the KL-divergence between two Gaussians, can be calculated in closed form.

- Thus, calculating the ELBO is tractable given the parameters.

- The VAE algorithm:

- Given an*x*, calculate the mean and variance of the variational distribution*q(z)*.

- Given*q(z)*, draw a sample of the latent variable*z*.

- Given the sample*z*, evaluate the parametric likelihood of*x*. - How to evaluate the gradient of the VAE despite the sampling step?

- The reparameterization trick rewrites a random variable as a deterministic function of some noise variable.

- The backpropagation algorithm only needs to pass through the non-stochastic branch. - To estimate the marginal distribution
*p(x)*, use the encoder to draw samples of*z*and do importance sampling.

## Diffusion Models

- A diffusion model consists of a forward diffusion process (the encoder) and a reverse process (the decoder).
- The diffusion process blurs observed data in an autoregressive way.

- The latent variables are of the same dimension as the observable data.

- Unlike the parametric encoder in a VAE, it is fixed to a Markov chain, where each transition is a Gaussian distribution with fixed constant variances (a.k.a. the noise schedule).

- With sufficiently many steps, the marginal distribution of the final latent variable is considered as a standard Gaussian distribution.

- The reverse process is another Markov chain with learned Gaussian transitions starting from a standard normal prior at
*t=T*.

- The true reverse process consists of complex multi-modal distributions that depend on*p(x)*.

- But we approximate them as Gaussian distributions with parametric means and constant variances, provided that each step is small enough.

- As in the VAE, we can generate samples of the observable data through ancestral sampling.

- The forward process admits sampling the latent variable at an arbitrary time step
*t*in closed form.

- This is known as the diffusion kernel, which is also Gaussian.

- The closed-form diffusion kernel leads to a closed-form forward process posterior, which is also Gaussian.

- As for the VAE, to maximize the data log-likelihood, we maximize the ELBO.

- The ELBO is separable for different steps of the reverse process, which enables doing gradient descent for different time steps separately.

- The loss term for each time step can be reparameterized as the squared difference between a standard Gaussian noise and the predicted noise.

- The noise predictor can be a neural network (e.g. U-Net) taking in a latent variable, a vector representing the time step, and (optionally) a time-invariant embedding for class information.

## References:

- Simon J. D. Prince.
*Understanding Deep Learning*. MIT Press, 2023. - Jonathan Ho, Ajay Jain, and Pieter Abbeel.
*Denoising Diffusion Probabilistic Models*. NeuralIPS, 2020.