Last Update:

DDPM Math

Forward process

Some inspiration from this post Forward is an adding noise process, so we simply modeling the process as

$ xt = a_t x{t-1} + b_t \epsilon_t \quad \epsilon_1, \epsilon_2, ..., \epsilon_t \in N(0,I) \; i.i.d \tag{1} $

$$ \begin{align} x1 = a_1 x_0 + b_1 \epsilon_1 \ x_2 = a_2 x_1 + b_2 \epsilon_2 \ \dots \ x_t = a_t x{t-1} + b_t \epsilon_t \ \end{align} \tag{2} $$

Elimination,

$$ xt = (a_t a{t-1} \dots a{1})x{0} + (a{t}\dots a{2}b{1})\epsilon{1} + (a{t}\dots a{3}b{2})\epsilon{2} + \dots \

  • (a{t}b{t-1})\epsilon{t-1} + b{t}\epsilon_{t} \tag{3} $$

A good observation is that,

$$ \begin{align} (at a{t-1} \dots a{1})^2 + (a{t}\dots a{2}b{1})^2 + (a{t}\dots a{3}b{2})^2 + \dots + (a{t}b{t-1})^2 + b{t}^2 \ = a{t}^2(a{t-1}^2(a{t-2}^2 \dots (a{2}^2(a{1}^2+b{1}^2) + b{2}^2) +\dots +b{t-2}^2) + b{t-1}^2) + b{t}^2 \end{align} \tag{4} $$

If we make $a{i}^2 + b{i}^2 = 1$, Eq. (4) = 1, to follow the original paper, we rewrite the $a$ and $b$. Let $ai^2 = \alpha{i}$, $bi^2 = \beta{i}$, $\alpha{i} = 1- \beta{i}$,

$$ \begin{align} x{t} = \sqrt{ \alpha{t} } x{t-1} + \sqrt{ 1 - \alpha{t} } \epsilon{t} \end{align} \quad (q(x{t}|x_{t-1})) \tag{5} $$

Let $\bar{ \alpha}{t} = (a{t}a{t-1}\dots a{2}a{1})^2$, for Eq. (3), since $\epsilon{1}, \epsilon{2},\dots, \epsilon{t}$ are independent, $N(0,I)$, the mean of their sum is 0, variance is (covariance is 0 due to the independence)

$$ Var = (a{t}\dots a{2}b{1})^2 + (a{t}\dots a{3}b{2})^2 + \dots + (a{t}b{t-1})^2 + b_{t}^2 = 1 - \bar{\alpha} \tag{6} $$

Thus, from $t=0$ to $t=t$, the forward process can be viewed as one step adding a gaussian noise $\bar{\epsilon}{t}$ with $E[\bar{\epsilon}{t}]=0$ and $V[\bar{\epsilon}_{t}] = 1 - \bar{\alpha}$

$$ \begin{align} x{t} = \sqrt{ \bar{\alpha{t}} } x{0} + \sqrt{ 1 - \bar{\alpha{t}} } \bar{\epsilon{t} } \end{align} \quad (q(x{t}|x_{0})) \tag{7} $$

Maximum Likelihood Loss Function Derivation

Refer also to Explanation from Hung-yi Lee Rethink this problem, what we have are some samples $x{1}, x{2}, \dots, x{m}$ sampled from the unknown distribution $p{data}(x)$ that we hope to estimate. (it can be also considered as $f_X(x)$ in probability theory)

We have a network parameterized with $\theta$, we hope the samples that we observed, i.e. $x{1}, x{2}, \dots, x{m}$, have the biggest probability to happen. But currently, we have 2 unknown things, i.e. $\theta$ and $\frac{p{\theta}(x)}{p(x_{i}|\theta)} $ (probability density function).

If $p(x_{i}|\theta)$ is known, then we can simply do

$$ \begin{align} \theta^\star & = argmax{\theta}\left( \prod{i=1}^m p(x{i}|\theta) \right) \ & = argmax{\theta}\left( \sum{i=1}^m \log(p(x{i}|\theta)) \right) \tag{When m is large, it is mean} \ & \approx argmax{\theta}\left( \frac{1}{m} \sum{i=1}^m \log(p(x{i}|\theta)) \right) \ \tag{$x$: R.V.; $x \sim p{data}(x)$ } & = argmax{\theta}(\mathbb{E} [\log(p(x|\theta))] ) \ & = argmax{\theta} \int{x} \log(p(x|\theta)) p{data}(x) \, dx \ & = argmax{\theta} \int{x} \log(p(x|\theta)) p{data}(x) \, dx \ & - \int{x} \log(p{data}(x)) p{data}(x) \, dx \ & = argmax{\theta} \int{x} p{data}(x) \log \frac{p(x|\theta)}{p{data}(x)} dx \tag{8} \end{align} $$

Recall KL divergence, it measures the difference between 2 distributions. KL Divergence

$$ D{KL}(P||Q) = \int{x}p(x) log \frac{p(x)}{q(x)} dx $$ where, $p(x)$ and $q(x)$ are R.V. $P, Q$'s probability density function.

Eq. (8) turns to,

$$ \begin{align} \theta^\star & = argmax{\theta} -D{KL}(p_{data}(x)||p(x|\theta)) \tag{8} \end{align} $$

But unfortunately, we don't know $p(x|\theta)$, so we are not able to perform the maximization above. So, we hope to find the lower bound to optimize. Back to VAE, we hope to maximize $\log p(x|\theta)$

$$ \begin{align} \theta^\star & = argmax{\theta}\log p(x|\theta) \ & = argmax{\theta} \log p(x|\theta) \int{z}q(z|x) dz \qquad \text{(integrate w.r.t $z$, the int is 1)} \ & = argmax{\theta} \int{z}q(z|x) \log p(x|\theta) dz \ & = argmax{\theta} \int{z}q(z|x) \log \frac{p(x,z|\theta)}{p(z|x, \theta)} dz \ & = argmax{\theta} \int{z}q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} \frac{q(z|x)}{p(z|x, \theta)} dz \ & = argmax{\theta} \int{z}q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} dz + \int{z}q(z|x) \log \frac{q(z|x)}{p(z|x, \theta)} dz \ & = argmax{\theta} \int{z}q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} dz + D{KL}[q(z|x)||p(z|x,\theta)] \qquad \text{($D{KL}\geq0$)} \ & \geq argmax{\theta} \int{z}q(z|x) \log \frac{p(x,z|\theta)}{q(z|x)} dz \ & = argmax{\theta} \mathbb{E}{q(z|x)} \left[\log \frac{p(x,z|\theta)}{q(z|x)}\right] \qquad \text{VAE Lower Bound} \end{align} $$

For DDPM, it is similar, the hidden states are the noisy states $x{1}, x{2,}..., x_T$, similarly, for DDPM,

$$ \begin{align} \theta^\star & = argmax{\theta} \mathbb{E}{q(x1,x_2,...,x_T|x)} \left[\log \frac{p(x{0,}x{1,}x{2,}..., x{T}|\theta)}{q(x{1,}x{2,}..., x{T}|x)}\right] \ & = argmax{\theta} \mathbb{E}{q(x{1:T}|x)} \left[\log \frac{p(x{0:T}|\theta)}{q(x_{1:T}|x_0)}\right] \qquad \text{Lower Bound} \end{align} $$

This Lower Bound can be simplified to

$$ \mathbb{E}{q} [ \underbrace{\log p(x_0|x{1,}\theta)}{L{0}\text{Reconstruction term}}- \underbrace{D{KL}(q(x_T|x_0)||p(x_T))}{L{T}(\text{Prior matching term})} - \sum{t>1} \underbrace{D{KL}(q(x{t-1}|x{t,}x_0)||p(x{t-1}|x{t,}\theta))}{L_{t}\text{Denoising matching term}} ] \tag{9} $$

$L_T$

It has no trainable parameters, $q$ is encoder, and $x_T$ is gaussian noise that we randomly sampled.

$L_t$

This term is the similarity between $q$ and $p$, $q$ is a gaussian distribution with a fixed mean and variance. Here, we fix $p$ 's variance (For simple computation? I guess. In Improved Denoising Diffusion Probabilistic Models, they make variance also trainable). Only mean is trainable. So we hope to minimize the difference between 2 means, i.e. $|| \mu_{p} - \mu_q ||^2$.

First, we compute $\mu_q$, more derivation see here Understanding Diffusion Models: A Unified Perspective

$$ \begin{align} q(x{t-1}|x{t,}x{0}) & = \frac{q(x{t-1},x{t}|x{0})}{q(x{t}|x{0})} = \underbrace{\frac{q(x{t-1}|x{0}) q(xt|x{t-1})}{q(x{t}|x{0})}}{\text{Still Gaussian}} \ & = ... \ & \propto N(x{t-1}; \underbrace{\frac{\sqrt{\alphat}(1-\bar{\alpha}{t-1})x{t} + \sqrt{\bar{\alpha}{t-1}}(1-\alphat)x_0}{1-\bar{\alpha}{t}}}{\mu_q(x{t,}x0)} , \underbrace{\frac{(1-\alpha_t)(1-\bar{\alpha}{t-1})}{1 - \bar{\alphat}} I}{\Sigma_q(t)}) \end{align} \tag{10} $$

Thus,

$$ \begin{align} \muq(x{t,}x{0}) & = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})x{t} + \sqrt{\bar{\alpha}{t-1}}(1-\alphat)x_0}{1-\bar{\alpha}{t},} \tag{11} \ \sigma^2{q}(t) & = \frac{(1-\alpha_t)(1-\bar{\alpha}{t-1})}{1 - \bar{\alpha_{t}}}\tag{12} \end{align} $$

We fix $\sigma^{2}{\theta}(t)=\sigma^{2}{q}(t)$, plug means and variances to KL divergence,

$$ \begin{align} \theta^\star & = argmin{\theta} D{KL}(q(x{t-1}|x{t,}x0)||p(x{t-1}|x{t,}\theta)) \ & =... \ & = argmin{\theta} \frac{1}{2\sigma^{2}{q}(t)} [|| \mu{\theta} - \mu_{q} ||^{2}] \tag{13} \end{align} $$

From Eq. (13), we can design 3 different loss functions

  • To predict $x0$: $\hat x{\theta}(x{t}, t)$ - We design $\mu{\theta}(x{t,}, t) = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})x{t} + \sqrt{\bar{\alpha}{t-1}}(1-\alphat)\hat x{\theta}(x{t}, t)}{1-\bar{\alpha}{t},}$, and plug back to Eq. (13), we obtain

$$ \begin{align} \theta^\star & = argmin{\theta} D{KL}(q(x{t-1}|x{t,}x0)||p(x{t-1}|x{t,}\theta)) \ & = argmin{\theta} \frac{1}{2\sigma^{2}{q}(t)} \frac{\bar{\alpha}{t-1} (1-\alphat)^2}{(1 - \bar{\alpha{t}})^{2}} [|| \hat x{\theta}(x{t}, t) - x_{0} ||^{2}] \qquad \text{Simplify to} \ & = ... \ & = argmin{\theta} \frac{1}{2} \left(\frac{\bar{\alpha}{t-1}}{1 - \bar{\alpha}{t-1}} - \frac{\bar{\alpha}{t}}{1 - \bar{\alpha}{t}}\right) \left[|| \hat x{\theta}(x{t}, t) - x{0} ||^{2} \right] \tag{14} \end{align} $$

  • To predict $\bar{\epsilon}{t}$: $\bar \epsilon{\theta}(x{t}, t)$. From Eq. (7), $x{0} = \frac{x{t} - \sqrt{ 1 - \bar{\alpha{t}} } \bar{\epsilon{t}}}{\sqrt{ \bar{\alpha{t}}}}$, plug to $\muq(x{t,}x{0})$, we get $\mu_q(x{t,}x{0}) = \frac{1}{\sqrt{\alpha_t}}x_t - \frac{1 - \alpha_t}{\sqrt{1-\bar \alpha_t} \sqrt{\alpha_t}} \bar \epsilon_t$, so we design $\mu_p(x{t,}x{0}) = \frac{1}{\sqrt{\alpha_t}}x_t - \frac{1 - \alpha_t}{\sqrt{1-\bar \alpha_t} \sqrt{\alpha_t}} \bar \epsilon\theta(x_t, t)$. Hence,

$$ \begin{align} \theta^\star & = argmin{\theta} D{KL}(q(x{t-1}|x{t,}x0)||p(x{t-1}|x{t,}\theta)) \ & = argmin{\theta} \frac{1}{2\sigma^{2}{q}(t)} \frac{(1-\alpha_t)^2}{(1 - \bar{\alpha{t}}) \alpha{t} } [|| \bar \epsilon{\theta}(x{t}, t) - \bar \epsilon{t} ||^{2}] \tag{15} \end{align} $$

  • To predict score function $s_\theta(x_t, t)$. Based on Tweedie's Formula:

    Tweedie's Formula:

    For a gaussian variable $z \sim N(z; \mu_z, \Sigma_z)$, $\mathbb{E} [\mu_z|z] = z + \Sigma_z \nabla_z \log p(z)$

For $q(xt|x_0)$, we have $\mathbb{E} [\mu{xt}|x_t] = x_t + (1 - \bar{\alpha{t}}) \nabla_{x_t} \log p(x_t)$, hence,

$$ \begin{align} \sqrt{\bar \alphat} x_0 & = x_t + (1 - \bar{\alpha{t}}) \nabla{x_t} \log p(x_t) \ x_0 &= \frac{x_t + (1 - \bar{\alpha{t}}) \nabla_{x_t} \log p(x_t)}{\sqrt{\bar \alpha_t}} \end{align} $$

Thus, $\muq(x{t,}x{0}) = \frac{1}{\sqrt{x_t}} x_t + \frac{1 - \alpha{t}}{\sqrt{xt}} \nabla{xt} \log p(x_t)$, and we design $\mu\theta(x{t,}x{0}) = \frac{1}{\sqrt{xt}} x_t + \frac{1 - \alpha{t}}{\sqrt{xt}} s\theta(x_t, t)$, hence

$$ \begin{align} \theta^\star & = argmin{\theta} D{KL}(q(x{t-1}|x{t,}x0)||p(x{t-1}|x{t,}\theta)) \ & = argmin{\theta} \frac{1}{2\sigma^{2}{q}(t)} \frac{(1-\alpha_t)^2}{ \alpha{t} } [|| s\theta(x{t}, t) - \nabla_{x_t} \log p(x_t) ||^{2}] \tag{16} \end{align} $$

We can drop the coefficients to train the model, it turns out that the 2nd way generates the best result ( $L_{simple}$ in the DDPM paper, Eq. (14)).

Sampling

Above is about training. When the model is trained, how do we generate the images from gaussian noises? 1. Sample Gaussian noise $xT \sim N(0,I)$ 2. Compute $x{t-1} = \frac{1}{\sqrt{\alphat}}x_t - \frac{1 - \alpha_t}{\sqrt{1-\bar \alpha_t} \sqrt{\alpha_t}} \bar \epsilon\theta(xt, t)$, which is actually the mean $\mu_p(x{t,}x{0})$, i.e. the mean for process $p\theta(x{t-1}|x_t)$ , we need to add variance (A kind of explanation) to form a complete signal $x{t-1}$. 3. Sample variance $z \sim N(0,I)$, give it the variance $\sigmaq(t)$. $$x{t-1} = \frac{1}{\sqrt{\alphat}}x_t - \frac{1 - \alpha_t}{\sqrt{1-\bar \alpha_t} \sqrt{\alpha_t}} \bar \epsilon\theta(x_t, t) + \sigma_q z \tag{17} $$

Intuition behind forward diffusion (Frequency domain)

CVPR2022 Diffusion Tutorial Part1 Here, we have $x_t = \sqrt{\bar \alpha_t} x_0 + \sqrt{1 - \bar \alpha_t} \epsilon$, we perform the Fourier transform, and obtain $F(x_t) = \sqrt{\bar \alpha_t} F(x_0) + \sqrt{1 - \bar \alpha_t} F(\epsilon)$.

image

  • $t$ is small, $\bar \alpha_t \sim 1$: Noise is small, mainly perturb the high frequency component, affecting the details/low-level component.
  • $t$ is large, $\bar \alpha_t \sim 0$: Noise is large, noise affect low frequency component, i.e. coarse content.

Improved DDPM

2 main improvements:

  • Making $\sigma_p(t)$ also trainable
  • Change the noise scheduling from Linear to Cosine

DDIM

A good explanation

  • Main Goal: Acceleration
  • Features: - Same objective function as DDPM - Acceleration: Use fewer steps for generation, i.e. backward process. - Deterministic: Generation process becomes a deterministic process, i.e. given a gaussian noise $x_t$, the generated image is fixed.
  • Motivation - We can find that the forward process is only related to $q(xt|x_0)$, so can we omit the intermediate steps? That is to say, we hope to derive the $q(x{t-1}|x{t,}x{0})$, w/o $q(xt|x{t-1})$.

Forward Process

In DDPM, what we do is that we use Bayes' theorem $q(x{t-1}|x{t,}x{0}) = \frac{q(x{t-1}|x{0}) q(x_t|x{t-1})}{q(x{t}|x{0})}$ to compute $q(x{t-1}|x{t,}x{0})$. However, it doesn't work here, since we have no knowledge of $q(x_t|x{t-1})$. But we can find that

$$ \begin{equation} \int{x_t} q(x{t-1} | xt, x_0) q(x_t|x_0) d x_t = q(x{t-1}|x_0) \end{equation} \qquad \tag{Marginal Distribution} $$

We set $q(x{t-1} | x_t, x_0)$ as gaussian distribution, and we know $q(x_t|x_0)$ and $q(x{t-1}|x0)$. Solve the equation (See Here ) (Constructing a solution $q(x{t-1} | x_t, x_0)$ )

$$ \begin{align} q(x{t-1} | x_t, x_0) & = N\left(x{t-1}; \sqrt{\bar \alpha{t-1}} x_0 + \sqrt{1 - \bar \alpha{t-1} - \sigmat^2} \cdot \frac{x_t - \sqrt{\bar \alpha_t} x_0}{\sqrt{1 - \bar \alpha_t}}, \sigma_t^2 I\right) \ \mu_q(x_t, x_0) & = \sqrt{\bar \alpha{t-1}} x0 + \sqrt{1 - \bar \alpha{t-1} - \sigma_t^2} \cdot \frac{x_t - \sqrt{\bar \alpha_t} x_0}{\sqrt{1 - \bar \alpha_t}} \ \sigma_t^2(t) & = \sigma^2_t \end{align} $$

Backward

We hope to minimize $L{t-1}$, so we hope $q(x{t-1} | xt, x_0)$ and $p\theta(x_{t-1} | x_t)$ are similar. So here we just construct (variance is also fixed here)

$$ \begin{equation} p\theta(x{t-1} | xt) = \begin{cases} \hat x\theta(x1, 1) + \sigma_1^2 \epsilon_1 & \text{if bank $i$ issues ABs at time $t$}\ q(x{t-1}|xt, \hat x\theta(x_t, t)) & \text{otherwise} \end{cases}
\end{equation} $$

where,

$$ \hat x\theta(x_t, t) = \frac{x_t - \sqrt{1 - \bar \alpha_t} \bar \epsilon\theta(x_t, t)}{\sqrt{\bar \alpha_t}} \tag{Used to estimate $x_0$} $$

The objective function stays the same.

$$ L{simple} = || \bar \epsilon{\theta}(x{t}, t) - \bar \epsilon{t} ||^{2} $$

Sampling

Similarly,

$$ x{t-1} = \sqrt{\bar \alpha{t-1}} \frac{xt - \sqrt{1 - \bar \alpha_t} \bar \epsilon\theta(xt, t)}{\sqrt{\bar \alpha_t}} + \sqrt{1 - \bar \alpha{t-1} - \sigmat^2} \cdot \bar \epsilon\theta(x_t, t) + \sigma_t \epsilon_t $$

Different variance strategies form different algorithm,

  • $\sigma_t = 0$: DDIM. Sampling process is deterministic, i.e. given $x_t$, we have a fixed $x_0$
  • $\sigmat = \sqrt{\frac{1 - \bar \alpha{t-1}}{1 - \bar \alpha_t}} \sqrt{1 - \alpha_t}$: DDPM.
  • Usually, they use $\eta$ to control how the sampling process is deterministic (DDIM) or stochastic (DDPM) A reddit link to explain this - $\eta = 0$: DDIM - $\eta = 1$: DDPM

Latent Diffusion & Stable Diffusion

DDPM and DDIM are both working on the image fields, which is computationally expensive.

  • Latent Diffusion - Move images to latent space $z_0$ by introducing a pre-trained VAE - Have conditions
  • Stable Diffusion: It is just an open source pre-trained version of latent diffusion.