Andrew's Notes

Efficient WaveRNN: Optimizing Nonlinearities

Andrew Gibiansky — Sun, 09 Jun 2024 20:44:18 GMT

In this series of posts, we're going to go through the WaveRNN neural vocoder for audio waveform synthesis, along with a variety of implementation details and commonly used extensions. For a real implementation, check out the gibiansky/wavernn repository.

Posts in the Series:

Accelerating Nonlinearities

In previous posts, we talked about a wide variety of optimizations, including rewriting the compute kernel in C++, batching GRU input matrix multiplication, using block-sparse matrix-vector multiplication, SIMD intrinsics, and quantization. Implementing all of these can create a faster-than-realtime synthesis kernel for WaveRNN, but there's still room to squeeze more out of our processors. Benchmarking our kernel, we observed that 10-20% of the time is spent in nonlinearities, including the tanh and sigmoid in the GRU and the softmax in the output layer.

Approximating Tanh and Sigmoid

The core autoregressive component of WaveRNN is a Gated Recurrent Unit (GRU) RNN. A GRU uses a state update function which requires two sigmoids and one tanh evaluation per state dimension. To compute these, you can use C++ standard library functions (very slow) or the Intel MKL library on x86 (faster), in either of these cases, these nonlinearities will end up a significant percentage of your inference time.

To speed these up, we can implement our own tanh and sigmoid which are slightly less accurate than standard library or MKL variants, but are significantly faster. First of all, we can write sigmoid as a rescaled tanh:

$$\sigma(x) = \frac{1}{2} \tanh\left(\frac{x}{2}\right) + \frac{1}{2}.$$

We will then use a Padé approximation of tanh. A Padé approximation of order [p/q] for a function $f$ is a rational function $F(x)$

$$F(x) = \frac{a_0 + a_1 x + a_2 x^2 + \cdots + a_p x^p}{1 + b_1 x + b_2 x^2 + \cdots + b_q x^q}.$$

Higher values of $p$ and $q$ allow for more precise approximations at the expense of additional computation.

The coefficients of a Padé approximation are defined to be the coefficients which have the first $p + q$ derivatives of tanh to match at zero.

$$\begin{align*}F(0) &= \tanh(0) \\ F'(0) &= \tanh'(0) \\ F''(0) &= \tanh''(0) \\ \vdots \\ F^{(p+q)}(0) &= \tanh^{(p+q)}(0) \end{align*}$$

This system of equations can be solved to yield unique coefficients for this approximation. That said, I'm lazy and will instead take advantage of some resources to make this easy:

Given these resources, we can implement an efficient tanh approximation with AVX, AVX-512, or NEON. This approximation can be fused with the sparse GRU GEMV if necessary and results in a 3-4X speedup for these nonlinearities over Intel MKL (which itself is a huge speedup over C++ standard library functions).

Faster Sampling with Gumbel-Softmax

In addition to the tanh and sigmoid in the GRU layer, our WaveRNN performs a softmax after its output layer and prior to sampling. Traditionally, this process has the following steps:

Given the final layer outputs $x_i$, compute the maximum value $x_{\text{max}}$.
Subtract the maximum value from each $x_i$.
Compute $e^{x_i - x_{\text{max}}}$ for every $x_i$ in the logits.
Compute the normalization term, $\left(\Sigma e^{x_i - x_{\text{max}}}\right)^{-1}$.
Multiply the exponentiated values by the normalization to get a probability distribution $p_i$.
Sample a random value $v \sim U(0, 1)$ from the uniform distribution [0, 1]. (In order to accelerate sampling in a tight inner loop, pre-compute several thousand random numbers and cycle through them.)
Find the index $i$ where the cumulative sum of $p_i$ is greater than $v$.

The index $i$ is a sample from your discrete probability distribution.

This sampling procedure has a few downsides. It requires requires scanning through your logits at least three times: once to find the maximum, once to exponentiate and compute normalization terms, and once to sample the final value. It also requires computing $e^x$, which is an expensive nonlinearity to compute accurately.

We can address both of these downsides using Gumbel-Softmax. Gumbel-Softmax was originally introduced in order to approximate sampling during training of a neural network, so a discrete sampling step could be introduced in an intermediate layer of a network. The key point for our purposes, however, is the following:

Sample $v_i \sim U(0, 1)$ from the uniform distribution [0, 1].
Compute $g_i = -\ln(-\ln v_i)$. $g_i$ is a sample from the Gumbel distribution.
Compute modified logits ${\hat x}_i = x_i + g_i$. (You must sample a distinct $g_i$ for each element of the logits.)
The index $i = \text{argmax}\; \hat x$ is a sample from the discrete distribution $\text{softmax}(x)$.

You can find the derivation for this neat property in the Gumbel-Softmax paper or the concrete distribution paper (which refers to the same distribution).

Using this property, we can do our sampling in a single pass. Given a vector of logits $x_i$ and a vector of pre-computed samples $g_i$ from the Gumbel distribution, we can compute $x_i + g_i$. As we compute this sum, keep track of the maximum value so far and the index of the maximum value. When you reach the end of $x$, the resulting index is a sample from your distribution.

Although sampling was a small fraction (5%) of our inference costs, applying this small optimization sped it up by a factor of five, making the cost of sampling negligible.

Summary

Once the matrix-vector multiplies in WaveRNN are sufficiently optimized, the nonlinearities (tanh, sigmoid, and softmax) start being a significant fraction of the WaveRNN inference time. Padé approximants, we can approximate tanh and sigmoid by an easy-to-compute rational function which can be computed in only a few arithmetic instructions. We can speed up sampling from a softmax by using the Gumbel-Softmax trick, drawing samples from a Gumbel distribution and taking an argmax of the sampled value plus the logit in order to sample from the original softmax distribution. These two optimizations, while small, can speed up inference by 10-15%.

Check out the implementation at gibiansky/wavernn or any of the other blog posts in this series:

Efficient WaveRNN: Block Sparsity

Andrew Gibiansky — Sun, 09 Jun 2024 20:43:44 GMT

Posts in the Series:

Accelerating Inference with Block Sparsity Matrices

As we reviewed in the previous blog post on WaveRNN inference, a single step of WaveRNN consists of sample embeddings, a GRU layer, two linear layers, and a sampling step.

Diagram of a single step of inference for WaveRNN. A sample is embedded with a sample embedding matrix and added to the conditioner output to create the GRU input. The result is run through a GRU, a linear layer, a ReLU, another linear layer, and a softmax activation. The distribution is sampled to produce the next sample in the waveform.

The bulk of the required compute (arithmetic) in this process can be grouped into four parts:

The GRU state-to-activation matrix multiply. For example, for a GRU with a state dimension of 512, this is a 512 x 1536 matrix-vector multiplication.
The hidden layer matrix multiply. For a hidden layer with 512 units and GRU with 512 units, this is a 512 x 512 matrix-vector multiplication.
The output layer matrix multiply. For a hidden layer with 512 units and 256 output buckets, this is a 512 x 256 matrix-vector multiplication.
The nonlinearities and element-wise operations, including tanh and sigmoid (for the GRU), ReLU (for the hidden layer), and softmax (for the output layer).

(As we saw in the previous post on WaveRNN inference, the GRU input-to-activation matrix can be accelerated by precomputing it and thus takes negligible compute.) Of these four steps, the first three (the matrix-vector multiplications) can be sped by replacing the matrices involved with sparse matrices.

Sparse Matrix Multiplication

A sparse matrix is one where a significant proportion of its entries are zero. For example, a matrix with 90% sparsity has 90% of its entries filled with zeros. Since multiplying by zero produces zero, these entries contribute nothing to the overall final result. This means that if we know which entries are zero and don't bother computing with them, we can reduce the amount of compute by 10X and in theory realize a 10X speedup.

Of course, it's not that easy. In practice, efficient sparse matrix multiplication algorithms are very challenging to write and require high degrees of sparsity (>80%) to obtain any speedup, and the speedup they do obtain can be meager (for example, a 2X speedup for a matrix with 95% sparsity).

Sparse matrix multiplication algorithms are often slow due to the overhead of tracking which matrix entries are zeros and due to poor memory access patterns. For example, a simple sparse multiply can be implemented by storing a list of coordinates of non-zero entries and iterating over them to perform the multiplications and accumulations on those elements. However, if you load the coordinates from memory, then load the value at that coordinate from memory, then do the multiplication and accumulation, you end up doing two memory reads (one for the coordinates and one for the value) per multiplication – twice as many memory reads as a dense matrix multiply. Additionally, since you are accessing non-contiguous parts of your matrix, you will have a very unpredictable and cache-unfriendly memory access pattern. When benchmarked against dense matrix multiplies with optimized tiling for cache-friendly memory access implemented with vector instructions (AVX, SSE, NEON), a naive sparse matrix multiply will end up much slower up until ridiculous levels of sparsity and very large matrices.

Block Sparsity

Luckily, in deep learning (unlike in some other fields), we rarely care too much about the specific locations of the non-zero entries. We can train our neural networks to use any sparsity pattern we desire. Thus, we can use block sparsity to make our sparse matrix multiplication kernels much easier to write and much more efficient.

A block sparse matrix is a sparse matrix where entire blocks (a rectangular grid of values in the matrix) of the matrix are either zero or non-zero. With block sparsity, we only need to store the indices of non-zero blocks, which can be a significant reduction in the amount of indexing we need to perform (relative to unstructured sparsity).

Diagram of four 8x8 matrices: a dense matrix, a matrix with unstructured sparsity, a block sparse matrix with 1x4 blocks, and a block sparse matrix with 2x2 blocks.

Additionally, we can choose the block size so that we can use our processor's vector registers to do our arithmetic. Pretty much all modern processors support some sort of vector arithmetic. Vector instructions allow us to execute a single instruction to load, store, or perform arithmetic on multiple values at the same time, while still taking only one clock cycle (approximately). For instance, if our matrix block size is equal to our vector register length (e.g. 8 floats with AVX), we can implement simple kernels which load and multiply blocks as just a few instructions per block.

Matrix Packing

While block sparsity easily addresses the amount of indexing loading and arithmetic we need to do in our compute kernel, our memory access patterns can still be quite unfriendly to the cache. Since the non-zero blocks may be far away from each other in memory when the matrix is laid out densely, the memory accesses will not be on the same cache line. Additionally, the CPU prefetcher will not be able to predict access patterns and fetch the needed cache line in advance.

To improve our memory access patterns, we can repack our matrices in memory for easier access. For the purposes of WaveRNN inference, our matrix is fixed and we reuse it thousands of times, so the cost of repacking the matrix is negligible.

We can repack our matrix into a representation consisting of three arrays:

A float array consisting of the matrix data. Only non-zero blocks are kept; all zeros are removed.
An integer array indicating the input indices corresponding to each matrix block.
An integer array indicating how many blocks correspond to each output block.

Packing a block-sparse matrix turns it into a linearly accessed data array, input index array, and blocks per row array.

Our block sparse matrix-vector multiplication kernels can then read through these arrays start to finish in a linear pattern. Since we store only non-zero entries, our packed matrix might fit entirely in cache, and the prefetcher together with our linear access patterns can ensure that all our data has been loaded into the cache by the time we need it.

A simple algorithm for multiplying with these packed matrices (assuming 1x4 blocks) is to loop over the rows of the output vector. For each row, look up the number of blocks you need to multiply. For each block you need to multiply, read it from the data buffer, find the index its being multiplied by, read from that index, and perform your multiplication and accumulation. With this algorithm, the data buffer, the input indices, and the blocks per row are accessed in a completely predictable linear fashion, leading to good performance without much modification.

Inducing Sparsity During Training

So far, we have discussed how to accelerate WaveRNN inference through switching out our dense matrix-vector multiplications for sparse (or block sparse) matrix-vector multiplications. To do so, we need to ensure that the weight matrices in the WaveRNN are primarily composed of zeros in a block sparse manner.

We can force our model to learn sparse weight matrices using magnitude-based weight pruning. During training, we identify the least important weights (as evidenced by their absolute value or magnitude) and then forcibly set them to zero. As training progresses, we snap progressively more and more weights to zero. Since the model starts out completely random, we allocate a bit of time (a warmup period) for the model to learn prior to beginning sparsification. The specific sparsification schedule used with WaveRNN is usually a cubic function which starts out rapidly pruning weights but, as the number of non-zero weights falls, slows down towards the end until it reaches its full expected level of sparsity.

Plot of enforced sparsity levels throughout training, starting with a warmup period with no pruned weights and ramping up to a very sparse model after a million iterations.

To get a block-sparse matrix, instead of pruning individual weights based on their magnitude, we prune blocks, where the magnitude of a block is defined as the maximum magnitude of the weights in the block. We can implement sparsity by computing a block mask after every training iteration based on the block magnitudes and setting the weights to zero for the lowest magnitude weights.

Deep neural networks tend to be vastly overparameterized, and so the models learned this way with a very high degree of sparsity (90-95% sparse) are only slightly worse than dense models. However, training sparse models requires a long time – they take a very long time to converge.

Summary

A large fraction of WaveRNN inference time consists of matrix-vector multiplications. We can train deep neural networks which use sparse matrices – matrices which have a large fraction with zero entries. Since zero entries don't contribute to the final output, we can write highly efficient inference kernels for sparse matrix-vector multiplications speed up WaveRNN inference significantly. Sparse matrix-vector multiplications with unstructured sparsity (where non-zero entries are located anywhere) require very high levels of sparsity, but we can require block sparsity (where non-zero entries are contiguous in blocks) which allow for much more efficient memory access patterns and higher speedups. Block sparsity integrates well with vector instructions on modern processors (such as AVX and NEON instructions) which allow processing multiple values with a single instruction.

Check out the implementation at gibiansky/wavernn or proceed to the subsequent blog posts:

Optimizing Nonlinearities

Efficient WaveRNN: Optimizing Arithmetic

Andrew Gibiansky — Sun, 09 Jun 2024 20:43:09 GMT

Posts in the Series:

Advanced Kernel Optimizations

Using WaveRNN in production relies on a host of optimizations which can accelerate the model by several orders of magnitude. As we discussed in previous posts, we can start our optimizations by rewriting the inference kernel in C++, batching the GRU input matrix multiply, and by block-sparsifying and repacking the weight matrices for use with a custom matrix-vector multiply (GEMV) kernel. Although these drastically increase the performance of our model, there's still a lot we can do to squeeze speed out of our processors.

SIMD Intrinsics

Generally speaking, a computer processor reads and executes a stream of instructions, where each instruction operates on one or two values in memory or in processor registers. However, in order to accelerate repeating the same operation across thousands or millions of values, most modern processors support some form of Single-Instruction Multiple-Data (SIMD) instructions. These instructions operate on vectors of a few contiguous values (and hence are often called vector instructions). Different processors use different SIMD instructions: for our purposes, we care about x86 AVX2 instructions (pre-2017 Intel), AVX-512 instructions (post-2017 Intel), and NEON instructions (ARM). NEON instructions operate on 128 bits of data, AVX2 on 256 bits of data, and AVX-512 on (you guessed it!) 512 bits of data.

In an ideal world, we would never have to think about what instructions our C++ compiler is generating to perform our arithmetic, and indeed, GCC and Clang try hard to auto-vectorize code and use SIMD instructions as much as they can. But look around you – the world is not ideal, not by a long stretch. For performance-sensitive parts of code, you can get significant speedups by directly writing SIMD instructions instead of relying on a compiler to guess what you mean. In fact, if you check out the kernel code in gibiansky/wavernn, you'll find direct SIMD implementations of almost every performance-sensitive part.

SIMD instructions are used in C / C++ through SIMD intrinsics, special functions which the compiler recognizes and converts to SIMD instructions. To give you a taste, let's go through how we would hand-vectorize a simple function which adds two float32 vectors:

// Computes out[i] = a[i] + b[i]
void elementwise_add(int size, float* out, float* a, float* b) {
    for(int i = 0; i < size; i++) {
        out[i] = a[i] + b[i];
    }
}

For a function this simple, you should not expect a huge performance increase for rewriting it with SIMD intrinsics; the compiler should do a good job auto-vectorizing this with -Ofast (though you may need to tell it these are not aliasing pointers with __restrict__). So treat this as an opportunity to look at some SIMD code, not as a real-world example.

When using AVX, we use __m256 and __m512 data types to represent vectors of 256 or 512 bits storing float data. Instructions for working with these are prefixed _mm256_ or _mm512_, respectively, and suffixed for the type of data they are working with (_ps for "packed single", _pd for "packed double", _ss for "scalar single", etc). For example, the AVX2 unaligned load intrinsic is _mm256_loadu_ps. (An unaligned load is a load from memory that may not be on a 32-byte boundary. Older processors execute aligned loads much faster than unaligned loads, though this penalty is lower on recent CPUs.)

Putting this all together, here is an equivalent function using AVX2 intrinsics:

#include 

// Computes out[i] = a[i] + b[i]
void elementwise_add(int size, float* out, float* a, float* b) {
    int i = 0;
    for(; i + 7 < size; i += 8) {
        // Load 8 floats from a.
        __m256 x = _mm256_loadu_ps(a + i);
        
        // Load 8 floats from b.
        __m256 y = _mm256_loadu_ps(b + i);
        
        // Sum up the floats.
        __m256 sum = _mm256_add_ps(x, y);
        
        // Write out the 8 floats to out.
        _mm256_storeu_ps(out + i, sum);
    }    
    
    // In case size is not divisible by 8.
    for(; i < size; i++) {
        out[i] = in1[i] + in2[i];
    }
}

AVX-512 will look very similar, using __m512 and _mm512 instead of __m256 and _mm256, respectively. (AVX-512 adds lots of other functionality besides longer vector registers, but it's not very relevant for this simple function.)

#include 	

// Computes out[i] = a[i] + b[i]
void elementwise_add(int size, float* out, float* a, float* b) {
    int i = 0;
    for(; i + 15 < size; i += 16) {
        // Load 16 floats from a.
        __m512 x = _mm512_loadu_ps(a + i);
        
        // Load 16 floats from b.
        __m512 y = _mm512_loadu_ps(b + i);
        
        // Sum up the floats.
        __m512 sum = _mm512_add_ps(x, y);
        
        // Write out the 16 floats to out.
        _mm512_storeu_ps(out + i, sum);
    }    
    
    // In case size is not divisible by 16.
    for(; i < size; i++) {
        out[i] = in1[i] + in2[i];
    }
}

NEON intrinsics for ARM use 128-bit vector registers. The types are of the form {data}x{count}_t; for example, float32x4_t is a 128-bit register with 4 float32 values in it. Intrinsics start with "v" (for "vector") and end with a suffix indicating the data type, such as "_f32" for 32-bit floats. Instructions which operate on 128-bit registers have named that end in "q". For example, loading from memory is done with the vld1q_f32 intrinsic, storing to memory uses the vst1q_f32 intrinsic, and vaddq_f32 adds float32x4_t values. Putting it together, you get the following elementwise sum function:

#include 

// Computes out[i] = a[i] + b[i]
void elementwise_add(int size, float* out, float* a, float* b) {
    int i = 0;
    for(; i + 3 < size; i += 4) {
        // Load 4 floats from a.
        float32x4_t x = vld1q_f32(a + i);
        
        // Load 4 floats from b.
        float32x4_t y = vld1q_f32(b + i);
        
        // Sum up the floats.
        float32x4_t sum = vaddq_f32(x, y);
        
        // Write out the 4 floats to out.
        vst1q_f32(out + i, sum);
    }    
    
    // In case size is not divisible by 4.
    for(; i < size; i++) {
        out[i] = in1[i] + in2[i];
    }
}

Quantized Inference

SIMD registers generally fit a fixed number of bits (128, 256, or 512), but depending on our data type, these can hold different amounts of values. For example, a 256-bit register can hold 8 32-bit floats, 16 16-bit floats or ints, ant 32 8-bit ints. Instructions for multiplication and addition generally take a single cycle (that is, you can complete one such instruction per cycle) no matter what data they are operating on, which means that reducing the bit precision of our operands is a great way to accelerate our inference kernels.

Unfortunately, unlike GPUs, CPUs thus far tend to have poor support for float16. This means that in order to squeeze more speed out of our kernels, we're going to have to shift to quantized arithmetic and do our matrix-vector multiplies in int8 or int16.

Quantizing WaveRNN to 8 bits results in significant quality degradation unless it is trained in a quantization-aware way, but if we stick to 16-bit inference, we can accelerate inference while keeping audio quality high.

In order to do an int16 matrix-vector multiply, we can:

Compute the maximum magnitude of each row, $\beta_r$.
Rescale each row to the range [-8192, 8192] by multiplying by $\frac{8192}{\beta_r}$.
Round each row to the nearest integer in int16.
Compute the maximum magnitude $\alpha$ of the input vector.
Rescale the input vector to the range [-8192, 8192] by multiplying by $\frac{8192}{\alpha}.$
Round the input vector elements to the nearest integer in int16.
For every row, compute the dot product with the input, doing multiplication in int16 and accumulation in int32.
Scale result of the dot product to undo the scaling done on the inputs, multiplying the results by $\frac{\alpha \beta_r}{8192^2}.$

Storing a per-row maximum weight magnitude is convenient if the matrix-vector multiply is done row-wise; another alternative with slightly reduced precision is to store a single scaling factor for the entire matrix.

Since the weights are fixed, we can perform steps (1), (2), and (3) in advance. This allows us to reduce the amount of data we load from RAM by 2x. In theory, we could get up to a 2X speedup, but in practice, getting a 1.5X speedup from quantization is more doable.

Summary

WaveRNN inference can be fast, but making it fast requires a variety of low-level optimizations to the inference kernels. One crucial optimization is using SIMD instructions such as SSE, AVX, AVX-512, NEON, and SVE for arithmetic when the processor the kernel is running on supports it. Although compilers have auto-vectorizers to take advantage of these instructions, manually writing your arithmetic routines using compiler intrinsics or assembly can still provide a speed boost. A second optimization is int16 quantization – since twice as many int16 values fit in vector registers as float32 values, rewriting matrix-vector multiplies to operate primarily on int16 data can yield a speed boost. Together, these optimizations can speed up a WaveRNN kernel significantly, allowing you to synthesize audio faster than realtime.

Check out the implementation at gibiansky/wavernn or proceed to the subsequent blog posts:

Efficient WaveRNN: Autoregressive Inference

Andrew Gibiansky — Sun, 09 Jun 2024 20:42:28 GMT

Posts in the Series:

WaveRNN Autoregressive Inference

As we discussed in the previous blog post, WaveRNN is an autoregressive neural vocoder which synthesizes audio sample-by-sample, as shown in the following diagram.

Neural network diagram for WaveRNN inference.

This means that after we run the conditioning network on the input spectrograms, we need to, for each sample:

Compute a sample embedding vector for the previously synthesized sample.
Add the sample embedding to the conditioning vector from the conditioning network to get the GRU input vector.
Run the GRU RNN on the GRU input. This consists of multiplying the input vector and the GRU state vector by the GRU weight matrices and then applying the GRU nonlinearities to compute a new GRU state (also its output).
Run the linear layer and the ReLU nonlinearity on the GRU output to get the hidden layer activations.
Run the final linear layer on the hidden layer activations and the softmax nonlinearity to get a discrete probability distribution over the next samples.
Randomly sample from the probability distribution to get the next sample.

Once the waveform is generated, dequantize the discretized samples and apply µ-law expanding to get the final waveform.

Streaming Inference

WaveRNN takes a lot of compute to run – for every synthesized sample, you need to run one timestep of the neural network. As a result, it can be quite slow to synthesize with. For interactive applications of TTS (such as voice assistants), you may want to start playing audio to the user before the entire synthesis is finished, which means you want to be able to stream through WaveRNN synthesis to minimize latency between receiving a TTS request and responding with initial synthesized audio.

To stream through the conditioning network, you can chunk up the input spectrograms into overlapping chunks and run those chunks separately through the network. (The chunks must be overlapping to avoid discontinuities at the boundaries; don't use zero padding when streaming through convolutions!) Alternatively, you can use a more clever approach to streaming through audio synthesis to avoid repeating computation in the conditioning network.

Streaming through the autoregressive network requires keeping track of two pieces of state: the previously synthesized sample (initialized to 128 representing zero) and the current GRU state (initialized to a vector of zeros). At each timestep, you run the autoregressive network to update the GRU state and generate a new sample.

Starting compute kernels generally has some overhead, so it is best to stream in chunks. A single invocation of the WaveRNN inference kernel should synthesize at least a few hundred samples (a few milliseconds of audio), and an outer loop should repeatedly call the inference kernel to synthesize the whole audio clip while sending intermediate results to the user.

Optimizations

Productionizing an implementation of WaveRNN requires a heavy focus on optimizing inference speed to achieve faster-than-realtime synthesis.

C++ Implementation: The first and largest optimization you can make is simply removing Python from the equation and implementing your inner inference loop in C++. For example in recent testing, using Python-based inference logic took about 100 seconds to synthesize 10 seconds of audio, but the same logic implemented in C++ (using the same matrix multiplication kernels, etc) took only 30 seconds (a roughly 3X speed-up). Implementing the kernel in C++ also opens the door to further optimizations, such as minimizing memory allocations, fine-grained multithreading, and more.

GRU Input Matrix Multiply Batching: Another optimization opportunity arises in the way GRUs are implemented. If you look at the PyTorch GRU implementation, we have two matrix multiplies: one that applies to the state and one that applies to the inputs.

$$\begin{align*}r_t &= \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\ z_t &= \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\ n_t &= \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\ h_t &= (1 - z_t) * n_t + z_t * h_{(t-1)}\end{align*}$$

For WaveRNN, the input is the sum of the conditioner network output $c_t$ (which changes once per frame) and the sample embedding $s_t$:

$$x_t = c_t + s_t.$$

We can compute a new sample embedding matrix which incorporates $W_{ir}$, $W_{iz}$, $W_{in}$, and their respective biases:

$$S = [W_{ir}; W_{iz}; W_{in}] s + [b_{ir}; b_{iz}; b_{in}].$$

We can also compute the product of the conditioner network $c_t$ with all these matrices. Since the conditioner network output is available in advance and only changes once per frame, we can batch these computations and do them once per frame outside the critical loop:

$$C = [W_{ir}; W_{iz}; W_{in}] c.$$

Then the GRU equations end up with one fewer matrix multiply:

$$\begin{align*}r_t &= \sigma(C_{rt} + S_{rt} + W_{hr} h_{(t-1)} + b_{hr}) \\ z_t &= \sigma(C_{zt} + S_{zt} + W_{hz} h_{(t-1)} + b_{hz}) \\ n_t &= \tanh(C_{nt} + S_{nt} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\ h_t &= (1 - z_t) * n_t + z_t * h_{(t-1)}\end{align*}$$

Implementing these optimizations leads to a 15-25% speedup, since you remove a significant fraction of the GRU layer compute from the inner loop.

Further Optimization: These optimizations are just the start and are insufficient for high quality synthesis at faster-than-realtime speeds. Further optimizations include block sparsity, int8 quantization, approximate nonlinearities, vector intrinsics (AVX-512, NEON, CUDA WMMA), multithreading, and so on.

Summary

The WaveRNN inference process consists of two distinct pieces: the conditioner network and the autoregressive inference. The conditioner network is simple and fast and can be implemented using standard PyTorch and Python tools. The autoregressive network runs once per audio sample and requires very heavy optimizations, starting with an implementation in C++. One straight forward optimization is batching the GRU input matrix multiply to reduce the amount of compute required for the GRU layer at each timestep. Many more inference optimizations are required to get a high quality faster-than-realtime neural vocoder.

Check out the implementation at gibiansky/wavernn or proceed to the subsequent blog posts:

Efficient WaveRNN: Reducing Quantization Noise

Andrew Gibiansky — Sun, 09 Jun 2024 20:41:04 GMT

Posts in the Series:

Quantization Noise

As we discussed in our introductory post, our WaveRNN represents audio using an 8-bit quantization, with each audio sample being stored as an integer between 0 and 255. Since audio is a continuous-valued signal (usually stored with 16 or 24 bits per sample), quantizing it into 256 different values causes audible degradation. The quantization noise is effectively uniform noise added to the signal, distributed evenly across the different frequencies as you can see below.

Log Mel spectrogram of an audio clip as well as the spectrograms of the same clip quantized to different bit depths, from 5-bit quantization (32 distinct values) to 10 bit quantization (1024 distinct values).

In addition to being easily visible in a spectrogram, you can hear the quantization noise in the audio samples below. (All audio samples in this blog post are based on the first sample in the LJ Speech dataset by Keith Ito.) With 5-bit quantization, you'll be able to easily hear the noise with laptop speakers, but as you get closer to 10-bit quantization, you'll need high volume or headphones to hear it clearly.

Original

5-bit Quantization

6-bit Quantization

7-bit Quantization

8-bit Quantization

9-bit Quantization

10-bit Quantization

µ-law Companding

The simple quantization we demonstrated in the spectrograms and audio clips above was done by splitting the audio range (-1, 1) uniformly into segments, with each segment covering an equal part of the audio range.

Part of the reason that the quantization noise is so audible, though, is that human hearing is sensitive to audio on a wide range of volume scales. (This is why the unit used to measure audio volume, decibels, is a logarithmic scale!) Thus, one approach to reducing quantization noise is to instead apply a non-uniform quantization, where the audio range near zero is quantized with much more detail than the range far from zero.

This non-uniform quantization is done by applying a transformation called µ-law companding prior to quantization. For a signal $x$, the companded signal $\hat x$ is

$$\hat x = \text{sgn}(x) \frac{\ln(1+ \mu |x|)}{\ln(1+\mu)}~~~~-1 \leq x \leq 1$$

When listening to the audio, the inverse operation (µ-law expanding) is done after dequantizing. The resulting audio, as you can see below for a variety of different quantization levels, sounds better – although you can still hear the effects of quantization noise as a subtle background buzz, even with 8-bit quantization.

5-bit Quantization, No Companding

5-bit Quantization, Companded with µ=255

6-bit Quantization, Companded with µ=255

7-bit Quantization, Companded with µ=255

8-bit Quantization, Companded with µ=255

Pre-Emphasis

To reduce the quantization noise even further, we'll apply another operation, pre-emphasis, prior to µ-law companding. (Correspondingly, the inverse operation, de-emphasis, is applied after µ-law expanding in order to listen to the audio.)

Quantization noise is effectively uniform noise, and thus has an equal power across the entire frequency spectrum. However, human perception of pitch is not linear, but rather logarithmic. For example, a pitch difference of 100 Hz is much more perceptible you are comparing 100 Hz to 200 Hz than if you are comparing 10 kHz to 10.1 kHz. The Mel scale is a logarithmic scale meant to emulate human hearing; pitches equally distant on the Mel scale are perceived by people to be equally distant. Because human perception of pitch is logarithmic, and quantization noise is equal across frequencies, human perception of quantization noise is primarily driven by high frequency quantization noise.

Thus, to reduce perceptible quantization noise, we can boost high frequencies using a pre-emphasis filter prior to quantization and attenuate those same frequencies using a de-emphasis filter after dequantization. This will leave the speech content (primarily in the lower frequencies) unaffected and will reduce the power of the added quantization noise in the higher frequencies (since the quantization noise is added right before high frequencies are attenuated).

To apply pre-emphasis, replace a signal $x[t]$ with $y[t]$ as defined by

$$y[t] = x[t] - \alpha x[t - 1].$$

$\alpha$ is a value between 0 and 1, frequently around 0.9 or 0.95 for audio applications.

For the sake of intuition, you can consider what this filter would do to a fixed or low frequency signal and what it would do to a quickly varying signal. If the signal is constant or slowly varying ($x[t] \approx x[t-1]$), then this reduces to approximately $(1 - \alpha)x[t]$, effectively attenuating it by $1 - \alpha$. If the signal is quickly varying, then $y[t]$ will have a high magnitude (little attenuation). (If you are of a more analytical bent, you can compute the frequency response of this filter by taking the Fourier transform of $y[t]$ and evaluating the resulting magnitude as a function of frequency – you will find that attenuation is minimal at the Nyquist frequency of half your sampling rate.)

You can observe the effect of pre-emphasis visually in the log mel spectrograms below. As you increase $\alpha$ and the strength of the pre-emphasis, the lower frequencies are attenuated and the higher ones become dominant.

Log mel spectrogram of an audio clip as well as the same audio clip with pre-emphasis with varying levels (alpha of 0.1, 0.5, 0.9, and 0.99).

You can also hear the effects of pre-emphasis in the audio clips below.

Original

No Quantization, Pre-Emphasis with α=0.5

No Quantization, Pre-Emphasis with α=0.9

No Quantization, Pre-Emphasis with α=0.97

To undo pre-emphasis by applying de-emphasis, compute $x[t]$ from $y[t]$ via the autoregressive equation

$$x[i] = y[i] + \alpha x[i-1].$$

The effects of pre-emphasis can be heard below. In these examples, the audio is pre-emphasized, companded (with µ = 255), quantized (8 bits), dequantized, expanded, and then de-emphasized.

Original

8-bit Quantization, Companding, No Pre-Emphasis (α=0.0)

8-bit Quantization, Companding, Pre-Emphasis with α=0.5

8-bit Quantization, Companding, Pre-Emphasis with α=0.9

8-bit Quantization, Companding, Pre-Emphasis with α=0.97

Training with Pre-Emphasis

As you can hear above, µ-law companding and pre-emphasis can be used to reduce the effect of quantization noise in audio. In order to model audio with WaveRNN, we have to quantize it to 8 bits (though 9 or 10 is also feasible), so that our final layer softmax outputs can have a reasonable dimension (256 for 8 bits, 512 or 1024 for 9 or 10 bits). In order to synthesize clean audio with WaveRNN, we can train our WaveRNN to produce audio which has been pre-emphasized and µ-law companded. To generate the final real-valued audio clip from the WaveRNN-generated integers, we dequantize, µ-law expand, and then de-emphasize the audio.

To recap, when training WaveRNN we:

Load and resample the audio clip to the needed sample rate.
Apply pre-emphasis with $\alpha \approx 0.9$ or 0.97.
Apply µ-law companding with $\mu = 255$.
Quantize to 8-bits symmetrically around zero.

When synthesizing with WaveRNN, we:

Autoregressively synthesize an 8-bit integer signal with WaveRNN.
Apply de-quantization, ensuring that 127 maps exactly to zero.
Apply µ-law expanding.
Apply the autoregressive de-emphasis filter as shown above.

This allows us to model high-fidelity audio with only an 8-bit output.

Summary

Our discrete-valued WaveRNN requires our audio to be quantized to an 8-bit representation. Naive quantization introduces a significant amount of noise, known as quantization noise. µ-law companding and pre-emphasis are two transformations we can apply to our signal prior to quantization in order to reduce the impact of quantization noise. If you apply pre-emphasis and companding to the input audio, you need to apply expanding and de-emphasis to the synthesized audio prior to listening to it. Together, these two transformations allow us to model high-fidelity audio with only 8 bits.

Check out the implementation at gibiansky/wavernn or proceed to the subsequent blog posts:

Efficient WaveRNN: Intro

Andrew Gibiansky — Sun, 09 Jun 2024 20:38:15 GMT

Posts in the Series:

Autoregressive Neural Vocoders

WaveRNN is an autoregressive neural vocoder. Before diving into WaveRNN itself, let's break that statement down a bit.

A text-to-speech (TTS) system which converts text into spoken audio is comprised of many components. For instance, the frontend of a TTS engine converts input text to phonemes. Prosody and acoustic models assign durations to those phonemes and convert them into spectrograms (or equivalent acoustic features). Finally, a unit selection algorithm or a vocoder convert those acoustic features into an audio waveform. A neural vocoder is the last step in a speech synthesis pipeline.

To summarize, a neural vocoder is a neural network based algorithm for converting audio from an acoustic feature representation (such as log mel spectrograms) to a waveform.

(To synthesize audio, the spectrogram representation must already exist and have been created by another step in the TTS pipeline. WaveRNN alone cannot generate speech from text.)

There are many possible ways to build a neural vocoder – autoregressive models, GANs, invertible normalizing flows, diffusion models. Audio synthesis is one of the most well-studied areas of neural generative modeling, lagging only behind image synthesis.

An audio waveform consists of thousands of samples. Each sample is a single number corresponding to an instantaneous reading from a microphone. An autoregressive model (such as WaveRNN) generates a stream of audio by predicting the next sample given all previous samples and the spectrograms of nearby audio frames.

To generate audio, we start with an empty audio stream and predict the first sample. Using the first sample, we predict the second sample. Using the first and second samples, we predict the third sample. This continues until the entire waveform is generated. The spectrograms are an extra input to each of these predictions and guide all the predictions made by WaveRNN, which is trained to generate audio which corresponds to the spectrograms it is given.

Baseline WaveRNN

In this section (and elsewhere), we assume that you are familiar with the basics of neural networks and deep learning. If not, grabbing a book (like this one or this one) may help.

WaveRNN consists of a few conceptual pieces:

The sample embeddings take the raw audio samples and process them to be used as inputs to the autoregressive network.
A conditioning network processes the log mel spectrogram features. This network is small and fast and consists of a few layers of convolutions. The outputs of this network are used as inputs to the autoregressive network.
The autoregressive network takes the conditioning information and sample embeddings up to a time t and predicts an audio sample for time t + 1.

The conditioning network can be run once on the entire input spectrogram, and the autoregressive network and sample embedding layer are alternated: a sample is generated, then prepared for the next timestep, then the next timestep is generated, and so on. At training time, though, the entire (real, non-synthesized) audio clip is available, and can be fed to the autogressive network, as shown below.

Diagram of the WaveRNN neural vocoder architecture.

Sample Embeddings: Discretized µ-law Audio

Audio waveforms are usually represented as a sequence of floating point numbers between -1 and 1. For WaveRNN, however, we apply two transformations to the audio waveform prior to using it:

µ-law Companding: We remap the -1 to 1 scale from a linear scale to a log scale to make small differences near zero much more perceptible.
Discretization: We divide the range -1 to 1 into 256 different equally spaced regions and represent each sample by the index of the region it falls into, with -1 mapping to 0 and 1 mapping to 255.

(You could equivalently say that we discretize the range -1 to 1 into 256 different regions with regions far away from zero having an exponentially larger width than regions close to zero.)

For a given value $x$, µ-law companding maps it to $F(x)$ via

$$F(x) = \text{sgn}(x) \frac{\ln(1+ \mu |x|)}{\ln(1+\mu)}~~~~-1 \leq x \leq 1$$

As shown in this animation, companding stretches the example audio waveform scale, so that small variations near zero become big variations. The human ear is sensitive to the log of amplitude / intensity, so without this, the generated audio would be perceived as noisy.

Animation alternating between a waveform and the companded version of the waveform.

The companded waveform is then converted to an integer (from 0 to 255) by subdividing the range -1 to 1 into equally spaced chunks. Mathematically, the quantization is done via $D(y)$ where

$$D(y) = \lfloor \frac{255}{2}(y + 1)+ \frac{1}{2}\rfloor.$$

We add $\frac{1}{2}$ so that zero (a common value in audio!) can be exactly mapped to the integer 128. You can verify that values near -1 will map to 0 and values near (but less than) 1 will map to 255.

Discretizing the audio waveform is crucial, because the way we predict the next audio sample is by treating the prediction as a multi-class classification problem. The network is trained to predict which class (from 0 to 255) the next audio sample falls in using a softmax cross entropy loss. To generate a sample, we sample from the multinomial distribution defined by the softmax probabilities.

To feed the audio samples into the autoregressive network, we convert them into sample embeddings by learning an embedding matrix with one row for each of the possible values. We can then obtain the input to the autoregressive network corresponding to an audio sample by looking at the corresponding row of the embedding matrix (that is, if the sample value is 193, we look at the 193rd value of the matrix and use that row as the input).

Conditioning Network

The input to a neural vocoder is a spectrogram or similar low-dimensional acoustic representation. A common choice is a log mel magnitude spectrogram. Let's briefly break that down!

Recall that a waveform can be represented as the sum of a large number of oscillating sine waves of varying frequencies and magnitudes. A magnitude spectrogram (computed via a short-time Fourier transform) indicates the power of the signal in each frequency band at a given time. In other words, a value in a spectrogram tells you the amplitude of the oscillating waves of a given frequency at a given time. A spectrogram looks like this:

Magnitude spectrogram of an audio clip.

As mentioned earlier, human hearing is sensitive to the log of the power, which means that small differences near zero are as audible as large differences far away from zero. In this plot, it's hard to see those small differences, so instead we use a log spectrogram:

Log magnitude spectrogram of an audio clip.

In addition to being sensitive to the log of power, human hearing is also sensitive to the log of frequency (pitch). High volumes in a narrow low frequency band near zero are as audible as high volumes spread throughout a wide high frequency band. To visualize this, we usually plot log mel magnitude spectrograms, as shown below. Mel spectrograms compress the frequency bands in a way that maps to human hearing.

Log magnitude mel spectrogram of an audio clip.

The conditioning subnetwork of WaveRNN takes log mel spectrograms (or other low-dimensional acoustic features) as input, normalizes them (to be roughly -1 to 1 in magnitude), and then processes them with several layers of alternating convolutions and nonlinearities. It is crucial that the convolutions are non-causal and have both forwards and backwards context, as predicting the next sample at any given point depends not only on the audio sample history but also on the future spectrogram. Including the future context is what makes the spectrogram conditioning information helpful for next sample prediction and thus what causes WaveRNN to closely follow the spectrograms in its generated audio.

A single frame of the spectrogram corresponds to many samples. Depending on the hop length of the short-time Fourier transform used to compute the spectrogram, each frame of the spectrogram will correspond to dozens or hundreds of samples. Thus, the output of the final convolution and nonlinearity needs to be upsampled by a corresponding factor. For example, if a spectrogram frame is computed from a hop length of 256 samples, then each output timestep must be upsampled to 256 timesteps prior to being provided as input to WaveRNN.

The output of the conditioning network (post-upsampling) is a sequence of hidden layer output vectors, each vector corresponding to exactly one sample in the audio being synthesized.

Autoregressive Network

The autoregressive network takes as input a sequence of sample embeddings and a sequence of conditioning vectors and uses them to predict the next sample in the sequence. This is done by:

Sum the sample embeddings and conditioning vectors to form a single input vector per timestep.
Run the input vectors through a Gated Recurrent Unit (GRU) RNN.
Take the last output of the GRU and run it through a linear layer followed by a ReLU activation.
Run the output of that through a linear layer with 256 outputs (for a 256-wide discretization) followed by a softmax activation.

The output of the softmax is a probability distribution. You can sample from that distribution to choose the next sample in the audio waveform.

To generate the full waveform, start by feeding the autoregressive network a sample corresponding to zero (assuming that most audio clips start with a bit of silence) and predicting the first sample. Then, feed the generated sample back to the autoregressive network to generate the second sample. Continue this process until the entire waveform is generated.

Since our audio is discretized, after the waveform is generated, the audio needs to be un-discretized and expanded (the opposite of companding is expanding) prior to being played back to the user.

Audio waveforms consist of many thousands of samples; a single second will usually contain between 16,000 samples (for 16 kHz audio) and 48,000 samples (for 48 kHz audio). This means that the GRU and linear layers must be run tens of thousands of times. This process is very computationally intense, and any non-computational overhead will result in it taking many seconds or even minutes to synthesize a short audio clip. In order to make WaveRNN usable for real-world applications, highly specialized and optimized implementations (compute kernels) are needed to make the synthesis process fast.

Summary

WaveRNN is an autoregressive neural vocoder. A neural vocoder is a neural network based process for converting low-dimensional acoustic features (such as log mel spectrograms) into an audio waveform. Autoregressive vocoders work by predicting the next sample in a stream of samples when given the acoustic features and all the previous samples. WaveRNN uses a discretized µ-law representation of audio to represent the audio as 8-bit integers and then predicts those values with a GRU, a fully connected layer, and a softmax layer, trained as a multi-class classification problem with a cross-entropy loss. Specialized compute kernels are necessary to make WaveRNN inference fast because an audio clip consists of tens or hundreds of thousands of samples, each of which require running a neural network to generate.

Check out the implementation at gibiansky/wavernn or proceed to the subsequent blog posts:

Streaming Audio Synthesis

Andrew Gibiansky — Fri, 26 Mar 2021 18:23:20 GMT

Nowadays, there are a boatload of different deep neural networks for audio synthesis. There's Tacotron, Tacotron2, WaveRNN, WaveGlow, WaveFlow, LPCNet, MelGAN, MB-MelGAN, and fifty more other networks handling every part of the text-to-speech audio synthesis pipeline.

When synthesizing audio in a text-to-speech scenario, you might want to synthesize minutes of audio at a time, for example, if you are reading a Wikipedia page or a news article. At the same time, you want the user to experience low latency (<200ms), so that they can start listening to it immediately. None of the aforementioned networks can synthesize several minutes of audio within 200 milliseconds – which means that to satisfy both of these constraints, you have to start streaming your output before you're done synthesizing.

In this blog post, I'd like to share an easy way to implement audio streaming for any network.

Naive Approach: Run Inference on Chunks

The naive approach to audio streaming is to break up the input into chunks, and then run synthesis on each chunk separately, glueing the results together to form the final audio stream. For RNNs (Tacotron, WaveRNN, etc), pass the hidden state between the chunks.

For RNNs, this approach works fine – it's identical to just running inference on the entire utterance at once. With CNNs, however, you run into trouble.

For a non-causal convolution with width k, you need to have k-1 timesteps of padding (half that on either side) in order to output something of the same size as the input. This means that if you've broken your input into chunks, you now need to add padding on either side of your input.

If you add zero-padding, then the outputs near the edges of your chunk may create a discontinuity, resulting in a periodic artifact or slight quality degradation. If you add padding from the original input to the CNN, you have to pad by the total receptive field of the network, which ends up duplicating computation and slowing down your overall synthesis speed. (For example, for a 5 layer CNN with width 7 kernels, your receptive field is 1+5*(7-1)=31, so you need 30 timesteps of padding.)

One approach is to synthesize overlapping chunks and then average them to reduce the effect of the discontinuities; however, this just patches over the problem, rather than solving it.

When trying to stream through a model with transposed convolutions (such as MelGAN), you have the same issue but for transposed convolutions. If you mix convolutions and transposed convolutions, even computing the receptive field becomes a bit confusing!

In summary – although breaking your input into chunks and running inference on each chunk separately does work, it can introduce wasted computation and discontinuities at near the boundaries.

Alternative Approach: "Perfect Streaming"

The approach of chunking up your input, synthesizing the chunks, and then glueing them together leads to wasted computation and discontinuities at chunk boundaries. What can we do instead?

Instead, with a little bit of work, we can implement "perfect streaming". Perfect streaming results in an output that is identical (barring floating point error) to the same utterance synthesized without streaming – no wasted computation, no discontinuities at chunk boundaries.

To do this, we take the following approach:

For every layer in your network, define an initial state, an update function, and a finish function. The update function incorporates new input and returns any available output; and the finish function completes the synthesis for that layer, returning any leftover outputs.
Compose those layers together: to initialize the full network, collect the initial states of each of the layers; to run an input through the network, run the update functions of the network layers sequentially; to finish synthesis, call finish and update on each of the layers of the network.

The result will be a network that can be initialized, run streaming synthesis on chunks of input with update, and can be finalized to get the last bit of output with finish.

Next, we'll go through how to implement these for each of the common layers. Reading through these examples will hopefully make it clear how this ends up working.

LSTMs and GRUs

Streaming through RNNs such as LSTMs and GRUs is easy! The initial state is simply the initial state (of zeros) for the LSTM or the GRU.

The update function will take the input and run it through the network, outputting exactly as many timesteps as it had in the input while also updating the state with the final state of the RNN after it has run on all the inputs.

The finish function in this case does nothing – there are no leftover outputs to be emitted.

Conv1D (Non-Causal)

Streaming through Conv1D is slightly more complex than an RNN, because you need to manage the extra state, composed of past inputs to the model.

The initial state for a Conv1D with (odd) width k is a tensor of zeros with (k-1)/2 timesteps. (That is, a tensor of shape batch_size x (k-1)/2 x num_input_channels.)

The update function will take the input, concatenate it with the state, and then run the Conv1D (in "valid" mode, with no extra padding), returning any resulting outputs. The new state is the last k-1 timesteps of the concatenated input tensor.

Finally, the finish function will take the state and pad it on the right with (k-1)/2 timesteps of zeros, run the Conv1D (with no extra padding), and return the resulting outputs.

To make this concrete, let's work through an example. Let's say we have a Conv1D layer with width 7, 256 input channels, operating with batch size 16. We initialize the state to a tensor of zeros of shape [16, 3, 256]. We have three input chunks of 4 timesteps each (total input size 12). We start by running update with the first chunk, creating a tensor of 3 + 4 = 7 inputs; after running the Conv1D, the layer produces 1 output. The last 6 of these are kept as the state. When we run update with the second chunk, we create a tensor of shape 6 + 4 = 10, which after we run Conv1D, produces 4 outputs. When we run update with the third chunk, we again produce 4 outputs. Finally, we run finish, which takes the state of 6 timesteps, pads them with 3 timesteps of zeros on the right, resulting in input size 9; after we run Conv1D, we produce 3 outputs. In total, we produce 1 + 4 + 4 + 3 outputs, for a total of 12 outputs – exactly as many outputs as we had inputs.

Conv1D (Causal)

A causal Conv1D is very similar to a non-causal Conv1D. However, instead of the initial state having (k-1)/2 timesteps, and then finish padding the sequence with (k-1)/2 timesteps on the right, we start with an initial state of k-1 timesteps of zeros. Besides that, everything stays the same.

Transposed Conv1D

A transposed convolution with stride upsamples the input; for this example, though, we'll assume a stride of one for simplicity. The implementation for a transposed conv looks very similar to a standard convolution – but instead of the state representing future inputs, the state represents a component of future outputs.

The key observation to make here is that when we break up a transposed convolution into two chunks, the outputs near the edge have contributions from both chunks. Each input timestep contributes to k different outputs, which means the last input timestep in a chunk affects the first (k-1)/2 outputs of the next chunk. We need to make sure to accumulate those outputs and return them only when all of their inputs' contributions have been accounted for.

The initial state for a transposed Conv1D with (odd) width k is a tensor of zeros with (k-1)/2 timesteps. (That is, a tensor of shape batch_size x (k-1)/2 x num_output_channels. Output channels, not input channels!)

The update function for an input with n timesteps will take the input and run it through the transposed convolution, generating n + k - 1 output timesteps. Take the current state and add it to the left edge of the output; that is, if you have (k-1)/2 timesteps of state, set the left (k-1)/2 timesteps to the elementwise sum of the state and the first timesteps of the transposed convolution outputs. If this is the first time update is being called, throw away the first (k-1)/2 timesteps. Return all but the last k-1 timesteps. Keep the last k-1 timesteps of the output as your state.

The finish function returns the first (k-1)/2 timesteps of the state as output.

To help understand this one, let's go through an example again; we'll use the same setup as we did with non-causal Conv1D. Let's say we have a transposed Conv1D layer with width 7, 256 output channels, operating with batch size 16. We initialize the state to a tensor of zeros of shape [16, 3, 256]. We have three input chunks of 4 timesteps each (total input size 12). We start by running update with the first chunk. With an input of width 4, the transposed Conv1D layer will create a tensor of 4 + 6 = 10 outputs. On the first time we call update, we drop the left 3 timesteps and return the next one timestep of outputs. The last 6 timesteps of output are kept as the state. When we run update with the second chunk, we get an tensor of shape 6 + 4 = 10 again, and we add the state (6 timesteps) to the left edge of that tensor. We return the first four timesteps and keep the last 6 timesteps as state again. When we run update with the third chunk, we again produce 4 outputs. Finally, we run finish, we take the first 3 timesteps of the state and return that. In total, we produce 1 + 4 + 4 + 3 outputs, for a total of 12 outputs – exactly as many outputs as we had inputs.

Concat and Sum

For some models (MelGAN, Resnets), you will end up with the output of two different subnetworks being concatenated or summed. These different subnetworks may have different inputs and thus might, during streaming, output different numbers of timesteps. To address this, you need to define update and finish for concat and sum operators, just like you do for compute layers such as convolutions and RNNs.

In MelGAN, you'll see this used for the generator resnets. The resnets use convolutions, which means that they output less timesteps than their input (initially), so naively summing the output of the resnet with its input doesn't work (since they'll have different numbers of timesteps).

For both concat and sum, the update function should take the minimum number of timesteps in the inputs and return the concatenation or sum of that many timesteps, storing all leftovers in the state. The finish function should – if everything works out – be left with nothing in the state, since by the time finish is called, all the inputs should be available with the same number of timesteps.

Implementation

You can implement these in practice as methods on the relevant Pytorch or TensorFlow layers. Then, when you use torch.nn.Sequential or keras.Sequential, you can compose the different layers into a single network with the same methods.

In the end, you end up with a flexible structure where you can freely experiment with RNNs, convolutions (with stride, dilation, etc), transposed convolutions, concat, resnets, and anything else, mixing the layers in any order and size, knowing that at inference your streaming implementation will be exactly identical to a non-streaming implementation. To test your implementation, you can write a unit test which tests every layer or network by running inference with an utterance and then running the same input but with its input broken into chunks, and then using np.allclose to verify that the outputs are identical.

If you have any questions about this technique, feel free to reach out and email me or find me on Twitter!

Brainstorming: Neural Transducers for Speech Synthesis

Andrew Gibiansky — Mon, 16 Nov 2020 00:49:25 GMT

Neural transducers are commonly used for automatic speech recognition (ASR), often achieving state-of-the-art results for quality and inference speech; for instance, they power Google's offline ASR engine. In this post, I'd like to propose a neural transducer model for speech synthesis. I'm writing this idea down before trying this model, so the reality is that this is just a fun avenue for me to brainstorm – and maybe someday I'll try this, or it'll inspire someone else to go down a similar path!

Neural Transducers for ASR

A neural transducer models the alignment between an input and output sequence, but does so in a way that allows aggregating label probabilities over all possible (monotonic) alignments. Given that ability, we can maximize the probability of a sample label and marginalize over the latent alignment.

A transducer model has the following inputs:

An input sequence $x_i$ of length $N$. For ASR, this is usually spectrogram frames or an analogous audio representation.
A label sequence $y_t$ of length $T$. For ASR, this is a sequence of letters, phonemes, or word pieces.
A character set of size $V$. The output sequence consists of these characters.

The model itself has three components:

An encoder network, $E(x)$. Outputs an encoding vector $h_t$ for each input timestep.
A decoder network, $D(y_{1:t})$. Outputs an encoding vector $g_u$ for each output prefix. (Similar to an autoregressive language model.)
A joint prediction network, $P(h_t, g_u)$. Outputs a probability distribution over the model character set plus the blank token $\varnothing$. Let index zero in this distribution represent $\varnothing$.

The encoder network is usually a bidirectional network of some kind, the decoder network is a unidirectional RNN or causal convolution model, and the prediction network is a multilayer feedforward model. The encoder runs over all $N$ input timesteps, the decoder runs over all $T$ timesteps, and the prediction network runs over all $N\cdot T$ pairs of encoding vectors from those two models.

The output of the prediction network $P(h_t, g_u)_k$ is the probability that, if input timestep $i$ was aligned to output timestep $t$, the next token is the $k$th character in the character set (if $k > 1$) or that the alignment should advance to the next encoder vector (if $k = 0$ for $\varnothing$).

Given this matrix of probability distributions, we can maximize over the probability of a label while marginalizing (summing) over alignments using a forward-backward algorithm. Given a specific alignment, the loss reduces to a cross-entropy next-character language-model-like prediction loss; introducing the blank character and joint network allows you to write out every possible alignment and its loss. Since there's an exponentially large number of possible alignments, we use a forward-backward algorithm, which allows summing over the alignments with a dynamic programming algorithm.

Proposed Neural Transducers for TTS

The RNN-transducer (RNN-T) model and loss is a great fit for end-to-end ASR model. However, it's initially a poor fit for text-to-speech (TTS) models: the model relies on a single discrete output per timestep, but TTS models such as Tacotron output continuous-valued spectrograms.

To fix this, I'd like to propose a modified RNN-T model called the "generative transducer". This model has the following components:

An encoder network, $E(x)$. Outputs an encoding vector $h_t$ for each input timestep.
A decoder network, $D(y_{1:t})$. Outputs an encoding vector $g_u$ for each output prefix.
A controller network, $C(y)$. Outputs an encoding vector $c_u$ for each output timestep.
A joint prediction network, $P(h_t, g_u)$. Outputs whatever the output of the model should be; in the case of TTS, this could for example be an 80-dimensional log-mel spectrogram.
A joint controller network, $\varnothing(h_t, c_u)$. Outputs the probability of advancing to the next encoder timestep instead of making this prediction.

In this model, we've decoupled the model output network $P(h_t, g_u)$ from the alignment generator network $\varnothing$; this model is strictly a generalization of the standard transducer model.

In the same way as before, we can use a forward-backward algorithm to sum over alignments. Instead of maximizing the probability of a particular label, we can maximize the expected value of the loss, as if we are sampling from the alignment distribution.

This setup provides us with an extra interesting modeling choice. If $C(y)$ is a bidirectional network, it can use future output context to improve its transition prediction. However, this means that we cannot use $C(y)$ at inference time. For TTS, this means that we must extract the most likely alignments once this model is trained, convert them to phoneme durations, and then at inference have a separate model to predict phoneme durations. If $C(y) = C(y_{1:t})$ is a unidirectional model, however, we don't need an external phoneme duration model; we can alternate running $D(y_{1:t})$ to predict a frame and sampling from $C(y_{1:t})$ to decide when to transition to the next phoneme.

Implementing the Generative Transducer Loss

Next, I'd like to go through the practical considerations of computing this loss function, similar to how the RNN-T paper derives forward and backward equations for the RNN-T loss.

Forward Pass

Similar to RNN-T, we'll compute transition probabilities $\varnothing(t, u)$ and outputs $P(h_t, g_u)$ for every point $(t, u)$. To compute the expected loss, we'll sum the loss for every $(t, u)$ weighted by the probability of encoder timestep $t$ being aligned to decoder timestep $u$.

So, let $\alpha(t, u)$ be the probability that encoder timestep $t$ is aligned with decoder timestep $u$. We can compute $\alpha(t, u)$ recursively via

$$\begin{align*}\alpha(t, u) &= \alpha(t - 1, u) \varnothing(t - 1, u) + \alpha(t, u - 1) (1 - \varnothing(t, u - 1)) \\ \alpha(1, 0) &= 1\end{align*}$$

The probability of emitting an output when encoder timestep $t$ is aligned with decoder timestep $u$ is then $\alpha(t, u) (1 - \varnothing(t, u))$.

The prediction network $P(h_t, g_u)$ yields an output vector for every point $(t, u)$. Let the loss for that output vector be $L(t, u).$ To create a Tacotron-like model, we would use an autoregressive L2 loss such as

$$L(t, u) = \left(P(h_t, g_u) - y_{u+1}\right)^2.$$

We could similarly use a discrete cross-entropy loss if our outputs are discretized.

Finally, the overall loss $L$ will be the expected value of $L(t, u)$ when summed over all the timesteps:

$$L = \sum_{t=1}^T \sum_{u=1}^U \alpha(t, u) (1 - \varnothing(t, u)) L(u, v).$$

Backward Pass

Backpropagating through the tensor operations that yield $L$ (from $\alpha(t, u)$) is easy. However, we will need a custom op to compute the full partials with respect to $\alpha(t, u)$ as well as $\varnothing(t, u)$, since $\alpha(t, u)$ is used in computation of future $\alpha(t + 1, u)$ and $\alpha(t, u + 1).$ For notational simplicity, let $\delta(t, u) = \frac{\partial L}{\partial \alpha(t, u)}$. We can compute the $\delta(t, u)$ via the recurrence relation (with base cases)

$$\begin{align*}\delta(t, u) &= \delta(t, u)_\text{base} + \varnothing(t + 1, u)\delta(t + 1, u) + (1 - \varnothing(t, u + 1)) \delta(t, u + 1) \\ \delta(T, u) &= \delta(T, u)_\text{base} + (1 - \varnothing(t, u + 1)) \delta(t, u + 1) \\ \delta(t, U) &= \delta(t, U)_\text{base} + \varnothing(t + 1, u)\delta(t + 1, u) \\ \delta(T, U) &= \delta(T, U)_\text{base}\end{align*}$$

$\delta(t, u)_\text{base}$ is the contribution to the partial that was backpropagated from $L$ via tensor operations.

Summary

This proposed model is an extension of the RNN-T model to situations with a continuous-valued output. We create an external alignment model which just models the transitions in the encoder, and then use its outputs to compute an expected value over the loss in all possible alignments.

As described, there's nothing penalizing the model for not using the encoder inputs; we in no way require the model to use all of its encoder inputs. Perhaps extra loss terms that encourage $\alpha(T, U)$ to be high and $\alpha(t, U)$ to be low for $t < T$ would be valuable.

PQMF: Sub-band Coding for Neural Vocoders (Part 2)

Andrew Gibiansky — Mon, 16 Nov 2020 00:48:04 GMT

This is a continuation of Part 1 of this two-part series. In this post, I'll try to go over the implementation of PQMF filters in sufficient detail such that you'll be able to use this technique in your own code.

Overview

In the previous post, I summarized by presenting a recipe for converting a neural vocoder into a sub-band vocoder:

Design a Prototype Filter: Choose a prototype filter $h[n]$ to use for your PQMF filter bank. The Kaiser window approach seems easiest here.
Compute PQMF Filter Bank: Choose the number of bands $K$ you plan to use. Then, compute your analysis and synthesis filters $h_k[n]$ and $g_k[n]$ using the equations above.
Training Data Analysis: Calculate sub-band signals for all your training data using the analysis filters.
Vocoder Training: Train your neural vocoders to predict the sub-band signals. Although you can train separate models per sub-band, to get inference speed improvements you must modify your model to output all sub-band signals in each timestep. This will reduce the number of output timesteps by a factor of $K$, which should reduce inference time by approximately that same factor. For example, for WaveRNN, you can have each timestep output $K$ values and input the previous samples for each sub-band. For MelGAN or WaveGlow, you can have the output consist of $K$ channels which get combined to create the final audio using your synthesis filters.
Inference-time Synthesis: After running your vocoder during inference, run synthesis on the outputs to get your new audio stream.

I left some of these pieces rather vague, and now it's time to fill in the details. Specifically, I'd like to address:

Designing a Prototype Filter: How do we compute a prototype filter?
Computing Analysis-Synthesis Filters: How do we create the necessary filters?
Implementation with Vocoders: How do we use these analysis and synthesis filters in standard deep learning frameworks?

The bulk of this post will be addressing prototype filters, and we'll be using PyTorch for all code samples.

Designing a Prototype Filter

In this section, we'll dive into a prototype filter design method based on Kaiser windows.

Vocabulary

Prior to jumping in, let's go over some vocabulary from filter design. For me, all of these rang bells from my signal processing classes years ago, but it was valuable to go through each of them.

Window Function: A window function is a function which is zero outside of a given interval and, usually, symmetric around zero. Window functions are multiplied with a signal before doing Fourier transforms to avoid discontinuities at the edge (since Fourier transforms assume that the signal is repeating).
Bessel Function: Bessel functions are a family of functions which are the solution to Bessel's differential equation, and are used in a variety of areas in physics, signal processing, etc. For our purposes, they're functions which can be used to create a bell-shaped window function.
Cutoff Frequency: The cutoff frequency is the frequency at which the response of a filter begins to attenuate significantly, for example, at which the filter reduces the power by a factor of two.
Stopband Attenuation: The stopband attenuation is the attenuation that must be attained in the stopband (the frequency range in which the filter is not supposed to let signal through).
Transition Bandwidth: The transition bandwidth (as the name implies) is the width of the frequency range between the passband and the stopband; good filters will minimize this width.

Kaiser Window Filter

tl; dr: You can do this with one call to scipy.signal.firwin.

Given a window length $N$, Kaiser window filter $w(n)$ is the following:

This defines a window of length $N+1$ using the modified Bessel function $I_0$ of the first kind. In practice, this can be computed using scipy.signal.windows.kaiser (which assumes $n$ is centered around zero by default).

The Kaiser window looks like you'd expect a window to look like:

Given a stopband attenuation $As$ and transition bandwidth $\Delta w$, the filter length should be approximately:

$$N \approx \frac{As -7.95}{14.36\Delta w} 2\pi.$$

The derivation for this specific fact is supposedly available here, but I am for the time being willing to take this on faith. Similarly, the needed value for $\beta$ is also a function of your stopband attenuation $As$.

Given your chosen window function $w(n)$, your final prototype filter can be computed from your desired cutoff frequency $\omega_c$:

$$h(n) = \frac{\sin(\omega_c n) w(n)}{\pi n}.$$

We have yet to compute the cutoff frequency $\omega_c$, which, according to our reference paper, is best computed by minimizing the objective function

$$\phi_{\text{new}}(\omega_c) = \max |h(n) * h(N - n)|.$$

(That is, given an $\omega_c$, compute the prototype filter, convolve it with its reverse, and take the maximum absolute value. More details discussed here, but this seems to be right.)

For computing this in practice, you can use scipy.signal.firwin. firwin allows you to specify a window function; additionally, if you specify the width parameter, you can directly set the transition bandwidth and it will calculate the appropriate value of beta for your Kaiser window.

Implementing Analysis-Synthesis Filters

Now that we have a prototype filter computed using scipy.signal.firwin, we need to create a set of $K$ analysis filters $h_k(n)$ and synthesis filters $g_k(n)$. Recall from the previous post that these are defined as follows in PQMF filter banks:

$$\begin{align*}h_k[n] &= h[n] \cos\left(\frac{\pi}{4K}\left(2k + 1\right)\left(2n - N + 1\right) + \Phi_k\right)\\g_k[n] &= h_k[N - 1 - n]\\\Phi_k &= (-1)^{k} \frac{\pi}{4}.\end{align*}$$

Computing this is not too hard, just a bit of arithmetic. The process of using these filters is then equivalent to standard convolution.

Pseudocode

With the math out of the way, let's convert this to code. Warning – don't copy this code and run it. I haven't tested it and I haven't run it, so it won't be useful to you that way. Instead, view this as a formalization of what I wrote above and trace through the logic to understand it.

def create_prototype_filter(bands: int, cutoff: float) -> np.ndarray:
    """Create a prototype filter."""
    transition_bandwidth = 0.9 * np.pi / 2 / bands
    attenuation = 100 # dB
    taps = int(2 * np.pi * (attenuation -7.95) / (14.36 * transition_bandwidth)) 
    return scipy.signal.firwin(
        taps, cutoff, width=transition_bandwidth, fs=2 * np.pi
    )
    
def optimize_cutoff(bands: int) -> float:
    """Choose the best cutoff frequency."""
    options = np.linspace(0.0, np.pi, 100000)
    best_cutoff, best_objective = None, 1e9
    for option in options:
        h = create_prototype_filter(bands, option)
        objective = np.abs(np.convolve(h, h[::-1])).max()
        if objective < best_objective:
            best_cutoff, best_objective = option, objective
    return best_cutoff
    
def create_filter_bank(bands: int):
    """Create the entire filter bank."""
    cutoff = optimize_cutoff(bands)
    proto = create_prototype_filter(bands, cutoff)
    taps = proto.size
    h = np.zeros((bands, taps))
    g = np.zeros((bands, taps))
    factor = (np.pi / (2 * bands)) * (
        np.arange(taps + 1) - ((taps - 1) / 2)
    )
    for k in range(bands):
        scale = (2 * k + 1) * factor
        phase = (-1) ** k * np.pi / 4
        h[k] = 2 * proto * np.cos(scale + phase)
        g[k] = 2 * proto * np.cos(scale - phase)
        
    return h, g

Conclusion

All in all, as is often the case with mathematical ideas, the amount of code it takes to implement all of this is pretty small. Thank you additionally to @kan-bayashi who's Github repo was hugely helpful in tracing through some of the logic required.

This approach definitely raises additional questions, which I look forward to answering or seeing answered:

How does the speed and quality depend on the number of bands, the number of taps, and the prototype filter used?
Instead of using this method to design the prototype filter, can we learn the prototype filter as part of vocoder training?
Do we need to use PQMF filters? Could we instead learn both analysis and synthesis filters entirely from scratch?

Definitely looking forward to seeing where all of this goes! I'm consistently impressed with the rapid progress in neural speech synthesis, so I'm sure the answers will come quickly.

PQMF: Sub-band Coding for Neural Vocoders (Part 1)

Andrew Gibiansky — Sun, 01 Nov 2020 00:27:41 GMT

In the past year or so, there's been several papers that investigate using sub-band coding with neural vocoders to model audio and accelerate inference:

In this blog post, I'd like to go over the ideas behind sub-band coding and specifically the math behind PQMF coding. You can find a more textbook-style approach to this in chapter 4 of this book (and, if you don't have a copy of this book or some of the above-linked papers, a wonderful person named Alexandra Elbakyan can likely help you).

Sub-band Coding

When you work with audio, you can represent it in the time domain (as a waveform) or in the frequency domain (as a spectrogram). A spectrogram is obtained through a Discrete Fourier Transform, and tells you the power and phase of the audio at every frequency present in the audio.

The key idea behind sub-band coding is that instead of representing the audio as a single high-frequency (24kHz) signal that covers the entire range of frequencies, we can instead represent it as multiple lower-frequency (e.g. 6kHz) signals that cover ranges of frequencies (sub-bands).

We can use this alternate representation for different applications. In file compression (MP3), this is used to apply different levels of quantization and compression to the different bands (because human hearing is sensitive to them in different ways). In neural vocoding for TTS, we can use this to accelerate inference by having our neural vocoders output multiple sub-band values per output timestep, thus reducing the total amount of compute needed while keeping quality high.

Sub-band coding consists of two phases. In the first phase, analysis, the signal is processed with a set of $k$ analysis filters, creating $k$ new signals. These signals are downsampled by a factor of $k$ by taking every $k$th value (thus maintaining the same total amount of data). In audio compression applications, the downsampled signals are processed (quantization, compression, etc) and transmitted. The second phase, synthesis, reconstructs the original signal from the downsampled and processed signals. To reconstruct the original signal, the signals are then upsampled by a factor of $k$ to the original data rate by inserting $k - 1$ zeros after each value. These upsampled signals are processed with a set of $k$ synthesis filters (one filter per signal) and added together. If the analysis and synthesis filters are chosen appropriately, the output signal can either approximate or perfectly reconstruct the input signal.

For the application of neural vocoding, we apply the analysis filters to the training data and train our networks to produce all $k$ sub-band signals as outputs. We then apply synthesis to our network outputs to produce the final audio stream. This can accelerate inference over standard modeling techniques because the signal we are producing has been downsampled by a factor of $k$, and different bands for a single timestep are modeled as conditionally independently.

Next up, let's talk about the math behind sub-band coding.

The Z Transforms

We'll soon be talking a lot about discrete signal filters. A filter is a function that, given an input signal $x[t]$, produces a modified signal $y[t]$. Just like we use spectrograms to analyze audio signals in the frequency domain, analyzing the behavior of filters is commonly done in the frequency domain as well, and we use the Z transform (a discrete equivalent of a Laplace transform) to convert filters into the frequency domain.

Since the Z transform is a little less common than the discrete Fourier transform, it's worth going over. If you're familiar with it, skip to the next section.

The discrete Fourier transform of $x[n]$ is defined as:

$$X[f] = \sum_{n=0}^N x[n] e^{\frac{-i f n 2 \pi}{N}}.$$

The Z transform generalizes the $e^{\frac{i 2 \pi}{N}f}$ term to be any complex value $z$ instead of being restricted to the unit circle, and is defined as:

$$X[z] = \sum_{n=0}^\infty x[n] z^{-n}.$$

The Z transform has similar properties to the Fourier transforms. The relevant ones to our discussions are:

Linearity: The transform of $a x[n] + b y[n]$ is $a X[n] + B y[n].$
Convolution: The transform of the convolution $x[n] * y[n]$ is $X[n] Y[n].$
Time Delay: The transform of $x[n + k]$ is $X[n] z^k.$

We will also need the Z transform for downsampling and upsampling (by dropping samples or inserting zeros):

Downsampling: The transform of $y[n] = x[nK]$ is $$Y[z] = \frac{1}{K} \sum_{k=0}^{K-1} X[e^{-i\frac{2\pi}{K}kn}z^{-n/K}.$$
Upsampling: The transform of $y[n] = x[n / K] \text{ if $n$ divides $K$ else } 0$ is $Y[z] = X[z^K].$

If you need to, rederive the above properties to make sure they make sense to you!

Frequency Domain Sub-Band Coding

Now that we have reviewed the Z transform, let's use it to analyze the synthesis and analysis processes and derive a set of filters for a pair of bands ($k=2$).

Consider the following diagram of analysis and synthesis (from Ch 4. of the aforementioned book):

In our setup, we have analysis filters $H_0$ and $H_1$ and synthesis filters $G_0$ and $G_1$. The Z transform of $x'[n]$ (expressed using the transforms of the bands and the synthesis filters) is then:

$$X'[z] = Y_0(z^2) G_0(z) + Y_1(z^2) G_1(z).$$

This relies on two previously-discussed properties. First, passing a signal through a series of filters multiplies the filters', because applying a filter is convolving a signal with the filter's impulse response. Second, applying a time delay of 2 to a signal $Y_0(z)$ yields $Y_0(z^2)$, as mentioned above.

By the same general approach, we can write $Y_0(z)$ and $Y_1(z)$ using the analysis filters and the input signal:

$$Y_i(z) = \frac{1}{2}H_i(z^{1/2}) X[z^{1/2}] + \frac{1}{2}H_i(z^{-1/2}) X[z^{-1/2}].$$

Combining this, we can write $X'[z]$:

$$\begin{align*}X'[z] =& \frac{1}{2}X[z]\left(H_0(z)G_0(z) + H_1(z)G_1(z)\right) + \\ &\frac{1}{2}X[-z](H_0(-z)G_0(z) + H_1(-z)G_1(z)) \\ =& X[z].\end{align*}$$

To achieve perfect reconstruction, we need to choose filters such that $X'[z] = X[z].$ There are many possible filters that satisfy this.

Quadrature Mirror Filters (QMF)

One common choice of filters is a set of Quadrature Mirror Filters.

To motivate this choice, we first of all want to get rid of the $X[-z]$ term in the equation above (this term causes aliasing). We can do so by setting:

$$G_0(z) = -H_1(-z)\\G_1(z) = H_0(-z).$$

Plugging this in to the equation for $X'[z],$ we get:

$$X'[z] =\frac{1}{2}X[z](-H_0(z) H_1(-z) + H_1(z)H_0(-z)).$$

The QMF solution continues to simplify by setting $H_1(z) = -H_0(-z),$ so that

$$X'[z] =\frac{1}{2}X[z](H_0(z)^2 - H_0(-z)^2).$$

We can then choose any $H_0(z),$ as long as it satisfies:

$$H_0(z)^2 - H_0(-z)^2 = 2z^{-D}.$$

The factor of $z^{-D}$ allows us to create a delayed output signal.

One way to achieve this is using a 2-tap Haar filter. This filter has the impulse response

$$h_0[n] = \left\{\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0, 0, \ldots\right\}.$$

To clarify, this means that $y[n] = \frac{1}{\sqrt{2}} (x[n] + x[n-1]).$ Using linearity and time delay properties, the Z transform of this Haar filter's impulse response is

$$H_0(z) = \frac{1}{\sqrt{2}}(1 + z^{-1}),$$

which we can verify meets the aforementioned constraint on $H_0(z)$.

From the equations we had earlier, we can drive all our analysis and synthesis filters:

$$\begin{align*}h_0[n] &= \left\{\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0, 0, \ldots\right\}\\h_1[n] &= \left\{-\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0, 0, \ldots\right\}\\g_0[n] &= \left\{\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0, 0, \ldots\right\}\\g_1[n] &= \left\{\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}, 0, 0, \ldots\right\}\end{align*}$$

With a bunch of arithmetic, you can verify that the output signal is indeed identical to the input signal.

While these filters do work mathematically, they end up being less useful in practice. For filters to be useful, you want them to separate the frequency domain neatly, such that each filter significantly attenuates frequencies outside its band and does not attenuate frequencies in its band. These Haar filters have incredibly wide bands, as shown below, and thus are not very useful:

There exist better but approximate QMF solutions that have tighter frequency bands; these are longer (they have more than two taps).

Pseudo-Quadrature Mirror Filters (PQMF)

The QMF filter bank works for two channels. However, in practice, we may want more than two channels. A generalization of the QMF filter bank to many channels exists, and is called the Pseudo-Quadrature Mirror Filter Bank (PQMF). These filters are called "pseudo-QMF" because these are approximate, not exact.

The filters used for DurIAN (see Appendix A) and MB-MelGAN are PQMF filters. These are created by following the methodology in "Near-Perfect-Reconstruction Pseudo-QMF Banks", and use four bands with a filter order of 63 (that is, have 63 "taps"). PQMF filter banks are also used in MP3 and other audio codecs.

PQMF filter banks consist of $K$ channels based on a low-pass filter $h[n]$. The $K$ analysis and synthesis filters are then computed from this chosen low-pass filter as follows:

$$\begin{align*}h_k[n] &= h[n] \cos\left(\frac{\pi}{4K}\left(2k + 1\right)\left(2n - N + 1\right) + \Phi_k\right)\\g_k[n] &= h_k[N - 1 - n]\end{align*}$$

The analysis filters $h_k[n]$ are cosine-modulated versions of the original filter (making PQMF filters part of a class of filters known as Cosine-Modulated Filter Banks or CMFBs). The synthesis filter is a time-reversed version of the analysis filter. Additionally, the cosine phases of adjacent bands are constrained and must satisfy (for integral $r$)

$$\Phi_k - \Phi_{k-1} = \frac{\pi}{2}(2r + 1).$$

One option, recommended in Eq. 11 here, is the following choice:

$$\Phi_k = (-1)^{k} \frac{\pi}{4}.$$

We can verify that this indeed meets the criterion above.

The choice of the prototype low-pass filter $h[n]$ is important for reducing reconstruction error. In fact, this filter can be found using computational methods by optimizing an objective function which minimizes the magnitude of the reconstruction error and maximizes stopband attenuation.

One approach to computationally designing the prototype filter is presented here; another approach based on limiting the search to Kaiser windows is available here. The latter approach seems significantly simpler to understand and easier to implement.

Training Neural Vocoders with PQMF Sub-band Coding

Finally, let's summarize the approach into a series of steps we can take to modify any neural vocoder with sub-band coding.

Design a Prototype Filter: Choose a prototype filter $h[n]$ to use for your PQMF filter bank. The Kaiser window approach seems easiest here.
Compute PQMF Filter Bank: Choose the number of bands $K$ you plan to use. Then, compute your analysis and synthesis filters $h_k[n]$ and $g_k[n]$ using the equations above.
Training Data Analysis: Calculate sub-band signals for all your training data using the analysis filters.
Vocoder Training: Train your neural vocoders to predict the sub-band signals. Although you can train separate models per sub-band, to get inference speed improvements you must modify your model to output all sub-band signals in each timestep. This will reduce the number of output timesteps by a factor of $K$, which should reduce inference time by approximately that same factor. For example, for WaveRNN, you can have each timestep output $K$ values and input the previous samples for each sub-band. For MelGAN or WaveGlow, you can have the output consist of $K$ channels which get combined to create the final audio using your synthesis filters.
Inference-time Synthesis: After running your vocoder during inference, run synthesis on the outputs to get your new audio stream.

As some meta-commentary, I'm fascinated that this clever idea took such a long time to reach neural vocoders, and I suspect that now that it's been shown to be effective in several works, it will spread quickly. This seems like a classic case of slow progress due to siloed knowledge: the people with the deep understanding of MPEG filter banks for the most part were unlikely to be training neural vocoders, and the people training neural vocoders were unlikely to have deep knowledge of MPEG filter banks. It took several skilled and cross-functional researchers to make this happen – at least one person to have the initial idea, and then several more to reproduce it in several more works to get wider acclaim for this idea. This also really highlights how valuable domain experience can be – you can't dive into a new field, sprinkle some machine learning fairy dust, and get great results!

Facebook's Knowledge-Assisted NLP

Andrew Gibiansky — Tue, 06 Oct 2020 20:08:30 GMT

Facebook recently published a blog post about their Retrieval-Augmented Generation (RAG) paper (published in May 2020). The blog post is light on detail, but, as usual, the news coverage is much worse (filled with ads and poorly written). I decided to dive in and figure out what the work was about.

In 2020, Facebook has had several publications about knowledge-aided NLP (KILT, RAG, BART, DPR, BLINK), so in this blog post I'd like to go through what all these acronyms are and how they fit together.

To summarize:

Dense Passage Retrieval (DPR): A model which, given a question, retrieves relevant passages from a database of passages built from Wikipedia.
BART: A sequence-to-sequence (seq2seq) version of BERT.
BLINK: A DPR-based entity linker.
Retrieval-Augmented Generation (RAG): A DPR- and BART-based question answering model.
Knowledge Intensive Language Tasks (KILT): A benchmark for evaluating quality of knowledge-based tasks, tested on baselines and Facebook's models.

I'll go through these models in more detail below.

Dense Passage Retrieval (DPR)

One of the core components of knowledge-based question answering is passage retrieval. Given a question, passage retrieval selects from a large database candidate passages that may be relevant to the question.

Traditional approaches to this problem include TF-IDF and BM25, which, at their core, are fairly simple models for assessing document similarity based on word frequencies in those documents. There are many variations and changes to the core models that make these work well, such as stemming, smoothing, stopword removal, etc.

Dense Passage Retrieval (DPR) (April 2020), a neural-network based passage retriever, is Facebook's approach to this challenge. Although it's published as an independent paper, it seems like it's really part of the same effort as RAG – published less than two months apart with significant author overlap.

The DPR model consists of two parts: a passage encoder (fine-tuned based on BERT) and a question encoder (also fine-tuned based on BERT). Each encoder returns a dense vector representation of its input by selecting the output embedding of the [CLS] token. Given a question vector and a passage vector, the similarity between the two is the dot product of the two vectors.

A single training sample for this model consists of a question, a positive passage (the passage which has the answer to the question), and a set of negative passages (which are irrelevant to the question). The model is trained to maximize similarity of the question vector to the positive passage vector, and minimize similarity to negative passage vectors. (The loss is equivalent to using the passage similarities as logits and then using a softmax cross-entropy loss.)

Models trained like this are dependent on the quality of the negative passages chosen. If all the negative passages are just completely random unrelated passages, the task is too easy, and the model will learn shallow features based on word frequency, etc, and won't generalize well. In order to learn high-quality representations, the negative passages for a question must be a mix of arbitrary unrelated passages and passages that are close to the positive but are still wrong; the latter of these two are called "hard negatives". DPR is trained with a single hard negative per question, which is sourced by running a BM25 passage retriever and choosing one of its retrieved candidates.

During inference, passage embeddings are generated by the passage encoder and cached. To retrieve passages related to a question, the question is encoded with the question encoder, and then a fast similarity search algorithm (FAISS) is used to find the top $k$ cached passage encodings with maximal similarity to the question encoding.

In the end, this neural passage retriever works better on most datasets than a Lucene-based TF-IDF or BM25 retriever. Given the recent experience in NLP, this isn't too surprising.

In the DPR paper, not only do they implement their passage retriever, but they also implement an extractive question answerer. This is an alternative to RAG, which generates the answer with a seq2seq model, instead of selecting a span in the supporting documents. As we'll see below, RAG, in some sense, is just DPR with a slightly more advanced answer-generating model.

BART: BERT for Seq2Seq Models

BART (Oct 2019) is a model from Facebook that attempts to answer the question: How do you do BERT, but for seq2seq models?

BERT is (effectively) a denoising autoencoder for text, replacing noised [MASK] tokens with the original tokens. BART is a denoising autoencoder as well, but one where the noise function can alter the sequence length, and thus it uses a seq2seq transformer instead of BERT's vanilla feedforward transformer. In some sense, BART is an extension of BERT, since it allows for a strictly more powerful noise function than BERT.

The biggest question in all of this is, what noise function do you use in this setup that yields a useful pretrained model?

In this paper, the following noise functions are evaluated:

Token Masking: Replace tokens with [MASK] (as in BERT).
Token Deletion: Delete tokens. Do not replace them with [MASK].
Token Infilling: Replace a span of text (length 0 upward) with [MASK]. (Thus, [MASK] isn't guaranteed to be a single token.)
Sentence Permutation: Shuffle sentences, delineated by periods. (Not ultimately helpful.)
Document Rotation: "A token is chosen uniformly at random, and the document is rotated so that it begins with that token. This task trains the model to identify the start of the document." (Not ultimately helpful.)

The encoder encodes a noised input, and the decoder (autoregressively) predicts the original input.

On first glance, this model felt weird to me. The additional noise function flexibility is obviously a positive, but using a seq2seq model to predict the output feels like overkill. However, the decoder effectively learns a smarter copy function, one which alternates between text copying and text generation appropriately.

The pretrained model can be fine-tuned for a variety of tasks, including sequence classification (using final timestep decoder state as output layer), token classification (using last layer decoder state for each token as outputs), and sequence generation (using the full model). Machine translation into English is also tried by re-initializing encoder token embeddings (and keeping the rest of the model).

Interpreting the results here is hard. Document rotation and sentence shuffling do not improve performance, which is unsurprising, given that they resemble next sentence prediction (NSP) in BERT, a loss which has been shown to be unnecessary or even harmful. Text infilling seems to be superior to other noise functions, which isn't too surprising -- it's strictly more general than masking or deletion. BART mostly does well on all the tested tasks, except for one, which seems to be an outlier, as it is best handled by a straight language model. BART isn't any better than SotA (state-of-the-art) on SQuAD and GLUE, but isn't any worse either. BART works better for summarization than other approaches, likely since summarization is a seq2seq task with a lot of copying in it.

All in all, it's a valuable data point, but I don't see this approach becoming popular outside of Facebook, possibly with the exception of summarization. The performance isn't generally superior, and there are too many details to interpret in this paper; it's hard to tell a single cohesive story about this paper. Regardless, it's one of the building blocks of RAG, the paper that initiated this blog post.

BLINK: Retrieval-Augmented Entity Linking

BLINK (September 2020) is a recent Facebook model for entity linking. Entity linking is the task or process of connecting a short span of text (a "mention") to an entity, an object in some sort of database with an associated description. Entity linking is inherently knowledge-based, since there can be millions of candidate entities. In some sense, question answering with a database (RAG) and entity linking are very similar tasks, with the caveat that entity linking is guaranteed to only link to a single entity, whereas knowledge-assisted question answering may require multiple supporting sources.

More specifically, BLINK is a zero-shot entity linker, making it even more similar to knowledge-assisted QA. Zero-shot, in this case, means that the entities are not part of the model, so the set of entities can be different during inference than during training. You won't find any learned entity embeddings in this paper, but you will find an entity encoder, so adding an entity just corresponds to running its description through the entity encoder. (In fact, to evaluate this fairly, the entity set used in training is disjoint from the test entity set.)

BLINK operates in two phases for performance reasons. The first phase chooses a set of candidate entities for each mention. The second phase links precisely one of those candidates to the mention. Since the first phase needs to consider millions of entities, it must be incredibly fast, while the second phase can involve more computation for each entity-mention candidate.

Model-wise, BLINK is more or less what you would expect. Phase one of BLINK is more-or-less identical to DPR (see above), with the difference that the hard negatives are sourced by running BLINK itself (rather than BM25, as in DPR).

Phase two of BLINK is yet another transformer (initialized with BERT), this time taking both entities (titles and descriptions) and mentions (along with context) at the same time and outputting a single vector by using the [CLS] output embedding. The output vector for each pair is reduced to a single logit with a fully-connected layer, and these logits are used with a softmax loss (with the target being the correct candidate entity). The candidates are generated for each mention by phase one, which means that any time phase one is retrained, phase two must also be retrained; the training distribution for phase two depends on the phase one performance.

As with DPR, selecting the top $k$ candidates in phase one is done by fast approximate nearest neighbor search (FAISS). A hyperparameter sweep suggests $k=10$ is optimal, and searching through 5.9M entities takes just 2ms at inference time.

To summarize the results: apply this at scale (5.9M entities from wikipedia), and it works great. As usual, train on a large dataset, fine-tune on your smaller dataset. As often lately in NLP, simple model designs and scale dominate the benchmarks.

Retrieval Augmented Generation (RAG)

Now, finally, onto the paper that spawned this blog post.

Retrieval-Augmented Generation (RAG) is a question answering model. It's roughly what you would get if you took DPR and then used your retrieved passages (along with your question) as input to a seq2seq model (pretrained via BART), which was trained to generate your answer.

If you have a passage retriever, you could take its outputs and then feed them as inputs to your seq2seq model, trained to generate the answer to your questions. However, this means that your two models need to be trained in sequence, and that your second model depends on your trained first models. Pipelines like this are harder operationally and generally more brittle – so instead, RAG opts to train this system end-to-end.

This is the key question for RAG, as I see it: How do you jointly train a passage retriever and a seq2seq answer generator?

To train this model end-to-end, you cannot simply choose and use the top passage from DPR. RAG, instead, marginalizes over the top-$k$ passages, and does so in two different ways. This bit is crucial, so I'm going to just screenshot the relevant passage in the paper:

In both of these models, we sum over the probabilities given the different top-$k$ passages, weighted by the probability assigned to each passage by the passage retriever. In sequence-level marginalization, we compute the probability of the target sequence conditional on the chosen passage (for the entire sequence), and then take the weighted average of those probabilities. In token-level marginalization, we compute the probability of the sequence as the product of the probabilities of the tokens, where each token probability is the weighted average of the token probabilities of the model conditioned upon different retrieved passages.

Decoding from these models must be done in different ways. When using token-level marginalization, decoding is easy since we can compute token probabilities (marginalized over passages); we can use a simple beam search. When using sequence-level marginalization, we cannot use a single beam search. Instead, we do $k$ separate beam searches and take the set of all their final candidates. We then evaluate all candidates likelihoods under all possible conditioning passages and score each candidate based on the weighted sum of those likelihoods. (Unfortunately, this is much slower, since each candidate must be evaluated with each possible conditioning passages; for $n$ candidates per search, you might have to do $k^2n$ evaluations, since each of $k$ passages might generate $n$ candidates which each must be evaluated under all $k$ conditionings.)

Much of the paper focuses on evaluating the created model and decoding schemes, and in general, the results are good. They show that substituting the knowledge base (by using a Wikipedia snapshot from a different year) significantly changes the answers, which is important as it demonstrates that the passages are being used effectively. There's no clear conclusion as to whether a token-level or sequence-level marginalization is preferable in general; it depends somewhat on the task. Questions that require using multiple sources (such as Jeopardy) are easier for token-level marginalization, but if the answer is generally contained in one passage, the results are less clear.

It's worth noting that even though this QA system seems state-of-the-art on many metrics, it still only achieves 50% accuracy (give or take) in human evaluations of its answers to Jeopardy questions. So we're pretty far from a simple end-to-end system which can reliably synthesize cohesive responses to any Jeopardy question when using Wikipedia as a database.

Knowledge Intensive Language Tasks (KILT)

Machine learning research is driven not only by model development, but also by a variety of other factors, such as hardware developments, datasets, and metrics. In NLP, there exist several commonly-used benchmarks for assessing model quality, such as SQuAD (for question answering) and GLUE (for general language understanding). These benchmarks are crucial for measuring the quality of models and the progress of the field as a whole. Additionally, benchmarks are key to every researchers dream – claiming state-of-the-art performance on their task of choice.

Knowledge Intensive Language Tasks (KILT) (September 2020) is a new benchmark from Facebook to assess progress in NLP for areas that require access to a large database of factual information. KILT is based on a snapshot of Wikipedia with five key tasks: fact checking, entity linking, slot filling, question answering, and dialogue.

Facebook's desire for a new benchmark is easily understandable, given all the work described above. On one hand, a benchmark is necessary for them to evaluate their own models and measure modeling progress, and given that a benchmark is necessary, it might as well be public and have an associated publication. On the other hand, this benchmark is practically explicitly created for Facebook's models to do well on – so it should come as no surprise that the best-performing models in the KILT paper are Facebook's BLINK, DPR, BART, and RAG. The paper ends up being half benchmark and half showing off the quality of the aforementioned models.

Even though the benchmark seems tailored to Facebook's prior work, it nonetheless seems like a very useful addition. From the best performance, it's clear that there is probably still room to improve, with maximum accuracy across all the tasks peaking at about 80%. We can't know for sure, as the benchmark doesn't include a human evaluation – it's possible that human performance would be no higher than the current models (although I doubt that's the case). The release also includes a library, so that future papers can evaluate their models on KILT using the same data and evaluation criteria.

It's too early to tell if this ends up being useful for the field. In the month since publication, there have been no citations, and the Github project is moderately quiet with 12 commits and 3 (closed) issues. However, it's only been a month, and the benchmark is also available through HuggingFace's library, which receives quite a bit of use and may drive more adoption. We'll see in 3-6 months whether this benchmark gets any uptake.

Summary

In the past year, folks at Facebook have done a ton of good work on knowledge-aided NLP. Almost all of the work is based on taking snapshots of Wikipedia, chunking it up into small BERT-sized passages, and then using BERT-based encoders and dot-product similarity to look up passages relevant to various target tasks. There's clearly room for improvement on all fronts, but explicitly incorporating knowledge databases into neural NLP seems like a great direction, and the results generally support that.

References

DiffWave and WaveGrad: Theory (Part 2)

Andrew Gibiansky — Mon, 28 Sep 2020 03:14:36 GMT

This is Part 2 of a blog post about DiffWave and WaveGrad. If you haven't, read Part 1!

In this post, I'll derive the equations for DiffWave and WaveGrad using diffusion probabilistic processes. As far as I can tell, there's no cohesive and simple explanation for this in any of the referenced papers, so I hope this is accurate and useful.

A diffusion probabilistic process (in this context) consists of the following:

$x_0$: A random variable representing your data distribution.
$x_1$, $x_2$, ..., $x_T$: A sequence of random variables which gradually add noise, starting from $x_T$.
$q(x_t | x_{t-1})$: A Gaussian distribution (also called the forward diffusion process) describing the process.
$p_\theta(x_{t - 1} | x_t)$: A parameterized Gaussian (also called the reverse process) which attempts to "undo" the forward process.

Forward Process

Let's assume that there are $T$ steps of corruption with noise variances $\beta_1$ through $\beta_2$. Then, each step corresponds to random variable $y_i$:

$$\begin{align*}x_1 &= \sqrt{1 - \beta_1} x_0 + \sqrt{\beta_1} \epsilon_1 \\ x_2 &= \sqrt{1 - \beta_2} x_1 + \sqrt{\beta_2} \epsilon_2 \\ \vdots \\ x_T &= \sqrt{1 - \beta_T} y_{T-1} + \sqrt{\beta_T}\epsilon_T \end{align*}$$

Since these are all linear combinations of $x_0$ and $\epsilon$ noise, we can write any step in a closed form:

$$x_t = \left(\prod_{i=1}^t\sqrt{1 - \beta_i}\right) x_0 + \sqrt{\left(1 - \prod_{i=1}^t 1 - \beta_i\right)}\epsilon$$

The factor on $x_0$ is intuitive (every step adds a multiplicative factor), but the factor on $\epsilon$ is not, and requires proof by induction to verify. To match the standard notation, define $\alpha_n$ and $bar \alpha_n$ as

$$\alpha_n = 1 - \beta_n \\ \bar \alpha_n = \prod_{i=1}^n \alpha_n,$$

at which point the above equation defines the forward process distribution

$$q(x_t|x_0) = N(\sqrt{\bar \alpha_t}x_0, (1 - \bar \alpha_t)I).$$

We'll need this closed form solution later!

Reverse Process

The reverse process $p_\theta(x_{t - 1} | x_t)$ is a Gaussian parameterized by a learned $\theta$:

$$p_\theta(x_{t-1} | x_t) = N(\mu_\theta(x_t, t), {\sigma_\theta(x_t, t)}^2 I).$$

We are free to choose the representation $\mu_\theta$ and $\sigma_\theta$. The training process will find parameters $\theta$ which best reverse the effects of the diffusion process, but the quality of this reversal (and thus synthesis quality) will depend on our choices for $\mu_\theta$ and $\sigma_\theta$.

Inference

To run inference, we will additionally need some latent prior, $p(x_T)$, which we can easily sample from. Once $p_\theta$ is trained, we can run inference on our model by sampling from $p(x_T)$ and then iteratively sampling $x_{T-1}$, $x_{T-2}$, and so on, until we get our results $x_0$. Sampling $x_{t-1}$ given $x_t$ and $p_\theta$ is straight-forward – run $\mu_\theta$ and $\sigma_\theta$ to get a mean and variance for your Gaussian for $x_{t-1}$ and then sample from it.

In order for this to work, the latent prior $p(x_T)$ must be sufficiently close to $q(x_T | x_0)$. (The KL-divergence $KL(q(x_T | x_0) || p(x_T))$ must be low.) We're going to use a zero-mean unit-variance prior $p(x_T) = N(0, I)$, which means that we need enough diffusion steps $T$ and large enough diffusion variances $\beta_t$ that the final distribution $q(x_T | x_0)$ resembles white noise. WaveGrad gets $T$ down to as few as 6 iterations by carefully tuning $\beta_t$, but most of the papers that use these methods have dozens or hundreds of iterations for this reason.

Training

We want to train $p_\theta(x_{t-1} | x_t)$ to most accurately sample $x_{t-1}$ given $x_t$. To do this, we would (ideally) like to minimize the KL-divergence to the forward process posterior $q(x_{t-1} | x_t)$:

$$\min_\theta J(\theta) = KL\left(q(x_{t-1} | x_t) \; || \; p_\theta(x_{t-1} | x_t)\right)$$

The forward process posterior is related to the forward process distribution via Bayes rule:

$$q(x_{t-1} | x_t) = \frac{q(x_t | x_{t-1}) q(x_{t-1})}{q(x_t)}$$

This distribution cannot be computed, as do not have access to $q(x_t)$ or $q(x_{t-1}).$ However, we do, if we condition upon $x_0$:

$$q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0)}{q(x_t | x_0)}$$

Recall (from above) the closed form equation

$$q(x_t|x_0) = N(\sqrt{\bar \alpha_t}x_0, (1 - \bar \alpha_t)I).$$

We now have closed form expressions for Gaussians $p_\theta(x_{t-1} | x_t)$ and $q(x_{t-1} | x_t),$ and the KL-divergence between two Gaussians can be computed analytically, which means that we can minimize this loss.

From Wikipedia,

$$KL(N_0 || N_1) = \frac{1}{2} \left( \frac{{\sigma_0}^2 + (\mu_1 - \mu_0)^2}{{\sigma_1}^2} - 1 + 2 \log \frac{\sigma_1}{\sigma_0} \right)$$

The two normal distributions in our case are

$$\begin{align*}N_0 &= q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0)}{q(x_t | x_0)}. \\ N_1 &= p_\theta(x_{t-1} | x_t) = N(\mu_\theta(x_t, t), {\sigma_\theta(x_t, t)}^2 I). \end{align*}$$

The algebra for simplifying $N_0$ gets gnarly. You start by explicitly writing the PDFs for the distributions involved:

$$q(x_t | x_{t-1}, x_0)=N(\sqrt{1 - \beta_t}x_{t-1}, \beta_t) = \frac{1}{\sqrt{2\pi\beta_t}}\exp\left(-\frac{(x_t - \sqrt{1 - \beta_t}x_{t-1})^2}{2\beta_t}\right) \\ q(x_{t-1} | x_0) = N(\sqrt{\bar \alpha_{t-1}}x_0, (1 - \bar \alpha_{t-1})I) = \frac{1}{\sqrt{2\pi(1-\bar\alpha_{t-1})}}\exp\left(-\frac{(x_{t-1} - \sqrt{\bar\alpha_{t-1}} x_0)^2}{2(1-\bar\alpha_{t-1})}\right)\\ q(x_t | x_0) = N(\sqrt{\bar \alpha_t}x_0, (1 - \bar \alpha_t)I) = \frac{1}{\sqrt{2\pi(1-\bar\alpha_t)}}\exp\left(-\frac{(x_t - \sqrt{\bar\alpha_t} x_0)^2}{2(1-\bar\alpha_t)}\right)$$

Next, substitute these into the forward process posterior conditioned upon $x_0$. After a lot of painful simplification, you get a PDF that corresponds to the following normal distribution:

$$ \frac{q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0)}{q(x_t | x_0)} = N(\tilde \mu(x_t, x_0), \tilde\beta_t) \\ \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1 - \bar\alpha_t}\beta_t\; ; \; \tilde\mu(x_t, x_0) = \frac{\sqrt{\bar\alpha_{t-1}}\beta_t}{1 - \bar\alpha_t}x_0 + \frac{\sqrt{\alpha_t(1-\bar\alpha_{t-1})}}{1 - \bar\alpha_t}x_t$$

Now, finally, we can substitute this and our reverse process Gaussians into the closed form for KL-divergence between two Gaussians and get our loss function. Once again, writing out the algebra is somewhat gnarly, but if you do so, and you fix the variance to $\beta_t$ (instead of learning it), you arrive in the end at Eq. 10 in this paper. From here on out, the reasoning is straightforward. Your network must learn $\mu_\theta$ to be

$$\mu_\theta(x_t) = \frac{1}{\sqrt{\alpha_t}} \left(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\epsilon\right).$$

Since we have access to $x_t$ itself, we can have the network directly predict $\epsilon$. Theoretically, it doesn't really matter if the network is trained to predict $\mu$ or $\epsilon$ or even $x_0$, but some configurations may be worse in practice.

Conclusion

In the first post, I described the overall structure of the DiffWave and WaveGrad models, but didn't explain how to choose the parameters $c_1$, $c_2$, and $\sigma$ in the model. In this post, we derived values for those parameters – the coefficients in the equation above for $\mu_\theta$. The derivation requires quite a bit of nasty algebra, which explains why it is not presented in full (unfortunately) in any of the linked papers. According to the authors, using these precise values may be important for best performance, which if true would be a rare case in deep learning of theory driving practice.

References

DiffWave and WaveGrad: Overview (Part 1)

Andrew Gibiansky — Mon, 28 Sep 2020 03:14:12 GMT

This is the first part of a two part blog post. If you've read this, move on to Part 2!

Two recent papers, DiffWave (NVidia) and WaveGrad (Google), propose a new neural vocoder model based on diffusion probabilistic processes. These vocoders have several nice properties – they achieve high quality synthesis, are non-autoregressive, are easy to train (no adversarial losses), do extremely well on unconditional audio synthesis, and are likely fast enough for eventual production deployment on GPUs.

These models are conceptually simple, but come with a fairly hard-to-parse theoretical justification. In this blog post, I'd like to go over how these models work at a high level, and then dive in to the theoretical justification for them.

Training

The model itself is a neural denoising autoencoder.

To train it, start with a clean audio signal $x$ and a sample of white noise $\epsilon \sim N(0, I)$. Then, create a corrupted audio signal $\tilde x$ by scaling the noise to have variance $\sigma^2$ and adding it to the clean audio:

$$\tilde x = \sqrt{1 - \sigma^2} x + \sigma \epsilon$$

If we scale $x$ by $\sqrt{1 - \sigma^2}$, the variance of $\tilde x$ is unchanged from the original audio, if your input is unit variance.

Then, train a neural network $f_\theta$ to predict the noise that was added with an L2 loss:

$$J(\theta) = \left(f_\theta(\tilde x, \sigma) - \epsilon\right)^2$$

The network $f_\theta$ is conditioned on the noise magnitude $\sigma$. (WaveGrad uses an L1 loss here, finding that it "offers better training stability".) This conditioning is important, as we will use different values of $\sigma$ throughout training. In a conditional synthesis setting, $f_\theta$ is also conditioned upon any input features such as linguistic information or mel spectrograms (but I won't write that explicitly here).

Inference

To use this model for inference, we will perform $T$ steps of denoising with sampling. First, we sample a starting point $y_T$ of white noise. Then, each step of denoising is computed with:

$$y_{t-1} = c_1 y_t - c_2 f_\theta(y_t) + \sigma_t z,$$

where $z \sim N(0, I)$. In essence, each step reduces the magnitude of the signal (scaling by $c_1$), removes some noise (subtracting $f_\theta(y_t)$), and samples some noise to add to the signal of variance $\sigma^2$ (with gradually decreasing $\sigma^2$ and $\sigma_1 = 0$).

The result, $y_0$, is the final audio.

Choosing values for $c_1$, $c_2$, and $\sigma_t$ depends on the theoretical justification for this model. In the WaveGrad paper, Section 2 describes an interpretation based on sampling via Langevin dynamics, in which case $c_1 = 1$, $c_2 = \frac{{\sigma_t}^2}{2}$, and $\sigma_t$ is any sequence of magnitudes that is sufficiently long and gradually decreasing.

However, both WaveGrad and DiffWave choose these values based instead on the interpretation of diffusion probabilistic processes, and according to one of the authors of DiffWave, the exact values are important for high-quality synthesis.

In Part 2, I'll dig into the somewhat gnarly math required to justify all of this, ignoring the Langevin dynamics interpretation entirely.

References

WaveNet and Tacotron aren't TTS systems

Andrew Gibiansky — Mon, 06 Apr 2020 02:09:45 GMT

Summary: Deep learning models for speech synthesis, such as Google's WaveNet and Tacotron, are not complete text-to-speech systems. They are each just one part of a large pipeline of models and heuristics that together form a text-to-speech engine. WaveNet is not a text-to-speech engine. Tacotron isn't either.

In the past few years, researchers have designed many neural network architectures for synthesizing audio. The most commonly referenced ones are Google's WaveNet and Tacotron, but there are many, many others, such as Google's WaveRNN, Baidu's Deep Voice papers, NVidia's WaveGlow, SampleRNN, Microsoft's FastSpeech, and many others.

These papers are announced to great fanfare on company websites and tech news sites, with great audio samples and interesting demos. This gives the impression that the paper describes a complete text-to-speech system. Google even brands its neural vocoder voices in Google Cloud as WaveNet voices, which can obscure the fact that a significant part of the TTS engine pipeline is shared between the two engines. As a result, a lot of people online end up (understandably!) confused and refer to a "WaveNet TTS system" or "Tacotron TTS system", or assume that a Github repo with a re-implementation of one of these can be used to build a complete speech synthesis engine.

This blog post is an attempt to rectify this (slight) misconception.

What is a text-to-speech engine?

A text-to-speech engine is a piece of software which converts text into speech (audio). This process is typically separated into a pipeline, where each step in the pipeline is its own model or set of models. An example pipeline might include:

Normalization: Converting non-spoken tokens (numbers, dates, etc) to spoken words, such as "1901" to "nineteen oh one" or "5/12" to "may twelfth".
Part-of-Speech Tagging: Labeling words by their part of speech.
Phoneme Conversion: Converting words to a phonetic representation, such as IPA.
High-Level Audio Synthesis: Converting the phonemes into a high-level representation of audio, such as mel spectrograms, F0, spectral envelope, LSP or LPC coefficients, etc.
Waveform Synthesis: Converting the high-level representation into a final audio waveform.

This list does not include miscellaneous things such as networking, request parsing, audio encoding, etc.

A complete TTS engine has to do (more or less) all of these things and connect them all together.

What are WaveNet and Tacotron?

WaveNet and Tacotron are neural network models that address one step of the above pipeline. Specifically, WaveNet is a neural vocoder, and is responsible for the "waveform synthesis" step of the pipeline. Tacotron is a sequence-to-sequence model for spectrogram synthesis, and addresses the "high level audio synthesis" step.

Now that we have these distinctions, we can ask more specifically: What models have been developed for each stage of this pipeline?

Normalization: Normalization is tricky to do with machine learning approaches, but people are working on it. For example, this and this.
Phoneme Conversion: Relative to the other problems on this list, this one isn't as hard, and so there's not as much research dedicated to it. You can find a few solid papers, such as this one and this one.
Audio Synthesis: In addition to Tacotron (and Tacotron 2), there's Deep Voice, FastSpeech, ParaNet, and more.
Waveform Synthesis: In addition to WaveNet, there's WaveGlow, WaveRNN, ClariNet, WaveFlow, SampleRNN, and more.

Disclaimer

I personally know the authors of many of the papers I listed above and worked on several of them myself. I'm not listing them in any order or trying to promote my own papers. These are just the systems that came to mind. Don't judge me for my choices here.

Additionally, I'm one of the founders of Voicery, where we build custom text-to-speech engines based on models similar to the ones above. If you have any questions or are looking to deploy your own models like these, check out our demos and get in touch.