### Facebook's Knowledge-Assisted NLP

A deep dive into several Facebook publications about knowledge-augmented language tasks, such as question answering and entity linking.

In this post, I'll derive the equations for DiffWave and WaveGrad using diffusion probabilistic processes.

*This is Part 2 of a blog post about DiffWave and WaveGrad. If you haven't, read Part 1! *

In this post, I'll derive the equations for DiffWave and WaveGrad using diffusion probabilistic processes. As far as I can tell, there's no cohesive and simple explanation for this in any of the referenced papers, so I hope this is accurate and useful.

A diffusion probabilistic process (in this context) consists of the following:

- $x_0$: A random variable representing your data distribution.
- $x_1$, $x_2$, ..., $x_T$: A sequence of random variables which gradually add noise, starting from $x_T$.
- $q(x_t | x_{t-1})$: A Gaussian distribution (also called the forward diffusion process) describing the process.
- $p_\theta(x_{t - 1} | x_t)$: A parameterized Gaussian (also called the reverse process) which attempts to "undo" the forward process.

Let's assume that there are $T$ steps of corruption with noise variances $\beta_1$ through $\beta_2$. Then, each step corresponds to random variable $y_i$:

$$\begin{align*}x_1 &= \sqrt{1 - \beta_1} x_0 + \sqrt{\beta_1} \epsilon_1 \\ x_2 &= \sqrt{1 - \beta_2} x_1 + \sqrt{\beta_2} \epsilon_2 \\ \vdots \\ x_T &= \sqrt{1 - \beta_T} y_{T-1} + \sqrt{\beta_T}\epsilon_T \end{align*}$$

Since these are all linear combinations of $x_0$ and $\epsilon$ noise, we can write any step in a closed form:

$$x_t = \left(\prod_{i=1}^t\sqrt{1 - \beta_i}\right) x_0 + \sqrt{\left(1 - \prod_{i=1}^t 1 - \beta_i\right)}\epsilon$$

The factor on $x_0$ is intuitive (every step adds a multiplicative factor), but the factor on $\epsilon$ is not, and requires proof by induction to verify. To match the standard notation, define $\alpha_n$ and $bar \alpha_n$ as

$$\alpha_n = 1 - \beta_n \\ \bar \alpha_n = \prod_{i=1}^n \alpha_n,$$

at which point the above equation defines the forward process distribution

$$q(x_t|x_0) = N(\sqrt{\bar \alpha_t}x_0, (1 - \bar \alpha_t)I).$$

We'll need this closed form solution later!

The reverse process $p_\theta(x_{t - 1} | x_t)$ is a Gaussian parameterized by a learned $\theta$:

$$p_\theta(x_{t-1} | x_t) = N(\mu_\theta(x_t, t), {\sigma_\theta(x_t, t)}^2 I).$$

We are free to choose the representation $\mu_\theta$ and $\sigma_\theta$. The training process will find parameters $\theta$ which best reverse the effects of the diffusion process, but the quality of this reversal (and thus synthesis quality) will depend on our choices for $\mu_\theta$ and $\sigma_\theta$.

To run inference, we will additionally need some latent prior, $p(x_T)$, which we can easily sample from. Once $p_\theta$ is trained, we can run inference on our model by sampling from $p(x_T)$ and then iteratively sampling $x_{T-1}$, $x_{T-2}$, and so on, until we get our results $x_0$. Sampling $x_{t-1}$ given $x_t$ and $p_\theta$ is straight-forward – run $\mu_\theta$ and $\sigma_\theta$ to get a mean and variance for your Gaussian for $x_{t-1}$ and then sample from it.

In order for this to work, the latent prior $p(x_T)$ must be sufficiently close to $q(x_T | x_0)$. (The KL-divergence $KL(q(x_T | x_0) || p(x_T))$ must be low.) We're going to use a zero-mean unit-variance prior $p(x_T) = N(0, I)$, which means that we need enough diffusion steps $T$ and large enough diffusion variances $\beta_t$ that the final distribution $q(x_T | x_0)$ resembles white noise. WaveGrad gets $T$ down to as few as 6 iterations by carefully tuning $\beta_t$, but most of the papers that use these methods have dozens or hundreds of iterations for this reason.

We want to train $p_\theta(x_{t-1} | x_t)$ to most accurately sample $x_{t-1}$ given $x_t$. To do this, we would (ideally) like to minimize the KL-divergence to the forward process posterior $q(x_{t-1} | x_t)$:

$$\min_\theta J(\theta) = KL\left(q(x_{t-1} | x_t) \; || \; p_\theta(x_{t-1} | x_t)\right)$$

The forward process posterior is related to the forward process distribution via Bayes rule:

$$q(x_{t-1} | x_t) = \frac{q(x_t | x_{t-1}) q(x_{t-1})}{q(x_t)}$$

This distribution cannot be computed, as do not have access to $q(x_t)$ or $q(x_{t-1}).$ However, we *do*, **if we condition upon $x_0$:**

$$q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0)}{q(x_t | x_0)}$$

Recall (from above) the closed form equation

$$q(x_t|x_0) = N(\sqrt{\bar \alpha_t}x_0, (1 - \bar \alpha_t)I).$$

We now have closed form expressions for Gaussians $p_\theta(x_{t-1} | x_t)$ and $q(x_{t-1} | x_t),$ and the KL-divergence between two Gaussians can be computed analytically, which means that we can minimize this loss.

From Wikipedia,

$$KL(N_0 || N_1) = \frac{1}{2} \left( \frac{{\sigma_0}^2 + (\mu_1 - \mu_0)^2}{{\sigma_1}^2} - 1 + 2 \log \frac{\sigma_1}{\sigma_0} \right)$$

The two normal distributions in our case are

$$\begin{align*}N_0 &= q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0)}{q(x_t | x_0)}. \\ N_1 &= p_\theta(x_{t-1} | x_t) = N(\mu_\theta(x_t, t), {\sigma_\theta(x_t, t)}^2 I). \end{align*}$$

The algebra for simplifying $N_0$ gets gnarly. You start by explicitly writing the PDFs for the distributions involved:

$$q(x_t | x_{t-1}, x_0)=N(\sqrt{1 - \beta_t}x_{t-1}, \beta_t) = \frac{1}{\sqrt{2\pi\beta_t}}\exp\left(-\frac{(x_t - \sqrt{1 - \beta_t}x_{t-1})^2}{2\beta_t}\right) \\ q(x_{t-1} | x_0) = N(\sqrt{\bar \alpha_{t-1}}x_0, (1 - \bar \alpha_{t-1})I) = \frac{1}{\sqrt{2\pi(1-\bar\alpha_{t-1})}}\exp\left(-\frac{(x_{t-1} - \sqrt{\bar\alpha_{t-1}} x_0)^2}{2(1-\bar\alpha_{t-1})}\right)\\ q(x_t | x_0) = N(\sqrt{\bar \alpha_t}x_0, (1 - \bar \alpha_t)I) = \frac{1}{\sqrt{2\pi(1-\bar\alpha_t)}}\exp\left(-\frac{(x_t - \sqrt{\bar\alpha_t} x_0)^2}{2(1-\bar\alpha_t)}\right)$$

Next, substitute these into the forward process posterior conditioned upon $x_0$. After a lot of painful simplification, you get a PDF that corresponds to the following normal distribution:

$$ \frac{q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0)}{q(x_t | x_0)} = N(\tilde \mu(x_t, x_0), \tilde\beta_t) \\ \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1 - \bar\alpha_t}\beta_t\; ; \; \tilde\mu(x_t, x_0) = \frac{\sqrt{\bar\alpha_{t-1}}\beta_t}{1 - \bar\alpha_t}x_0 + \frac{\sqrt{\alpha_t(1-\bar\alpha_{t-1})}}{1 - \bar\alpha_t}x_t$$

Now, finally, we can substitute this and our reverse process Gaussians into the closed form for KL-divergence between two Gaussians and get our loss function. Once again, writing out the algebra is somewhat gnarly, but if you do so, and you fix the variance to $\beta_t$ (instead of learning it), you arrive in the end at Eq. 10 in this paper. From here on out, the reasoning is straightforward. Your network must learn $\mu_\theta$ to be

$$\mu_\theta(x_t) = \frac{1}{\sqrt{\alpha_t}} \left(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\epsilon\right).$$

Since we have access to $x_t$ itself, we can have the network directly predict $\epsilon$. Theoretically, it doesn't really matter if the network is trained to predict $\mu$ or $\epsilon$ or even $x_0$, but some configurations may be worse in practice.

In the first post, I described the overall structure of the DiffWave and WaveGrad models, but didn't explain how to choose the parameters $c_1$, $c_2$, and $\sigma$ in the model. In this post, we derived values for those parameters – the coefficients in the equation above for $\mu_\theta$. The derivation requires quite a bit of nasty algebra, which explains why it is not presented in full (unfortunately) in any of the linked papers. According to the authors, using these precise values may be important for best performance, which if true would be a rare case in deep learning of theory driving practice.