Efficient WaveRNN: Optimizing Arithmetic

In this series of posts, we're going to go through the WaveRNN neural vocoder for audio waveform synthesis, along with a variety of implementation details and commonly used extensions. For a real implementation, check out the gibiansky/wavernn repository.

Posts in the Series:

Advanced Kernel Optimizations

Using WaveRNN in production relies on a host of optimizations which can accelerate the model by several orders of magnitude. As we discussed in previous posts, we can start our optimizations by rewriting the inference kernel in C++, batching the GRU input matrix multiply, and by block-sparsifying and repacking the weight matrices for use with a custom matrix-vector multiply (GEMV) kernel. Although these drastically increase the performance of our model, there's still a lot we can do to squeeze speed out of our processors.

SIMD Intrinsics

Generally speaking, a computer processor reads and executes a stream of instructions, where each instruction operates on one or two values in memory or in processor registers. However, in order to accelerate repeating the same operation across thousands or millions of values, most modern processors support some form of Single-Instruction Multiple-Data (SIMD) instructions. These instructions operate on vectors of a few contiguous values  (and hence are often called vector instructions). Different processors use different SIMD instructions: for our purposes, we care about x86 AVX2 instructions (pre-2017 Intel), AVX-512 instructions (post-2017 Intel), and NEON instructions (ARM). NEON instructions operate on 128 bits of data, AVX2 on 256 bits of data, and AVX-512 on (you guessed it!) 512 bits of data.

In an ideal world, we would never have to think about what instructions our C++ compiler is generating to perform our arithmetic, and indeed, GCC and Clang try hard to auto-vectorize code and use SIMD instructions as much as they can. But look around you – the world is not ideal, not by a long stretch. For performance-sensitive parts of code, you can get significant speedups by directly writing SIMD instructions instead of relying on a compiler to guess what you mean. In fact, if you check out the kernel code in gibiansky/wavernn, you'll find direct SIMD implementations of almost every performance-sensitive part.

SIMD instructions are used in C / C++ through SIMD intrinsics, special functions which the compiler recognizes and converts to SIMD instructions. To give you a taste, let's go through how we would hand-vectorize a simple function which adds two float32 vectors:

// Computes out[i] = a[i] + b[i]
void elementwise_add(int size, float* out, float* a, float* b) {
    for(int i = 0; i < size; i++) {
        out[i] = a[i] + b[i];
    }
}

For a function this simple, you should not expect a huge performance increase for rewriting it with SIMD intrinsics; the compiler should do a good job auto-vectorizing this with -Ofast (though you may need to tell it these are not aliasing pointers with __restrict__). So treat this as an opportunity to look at some SIMD code, not as a real-world example.

When using AVX, we use __m256 and __m512 data types to represent vectors of 256 or 512 bits storing float data. Instructions for working with these are prefixed _mm256_ or _mm512_, respectively, and suffixed for the type of data they are working with (_ps for "packed single", _pd for "packed double", _ss for "scalar single", etc). For example, the AVX2 unaligned load intrinsic is _mm256_loadu_ps. (An unaligned load is a load from memory that may not be on a 32-byte boundary. Older processors execute aligned loads much faster than unaligned loads, though this penalty is lower on recent CPUs.)

Putting this all together, here is an equivalent function using AVX2 intrinsics:

#include <immintrin.h>

// Computes out[i] = a[i] + b[i]
void elementwise_add(int size, float* out, float* a, float* b) {
    int i = 0;
    for(; i + 7 < size; i += 8) {
        // Load 8 floats from a.
        __m256 x = _mm256_loadu_ps(a + i);
        
        // Load 8 floats from b.
        __m256 y = _mm256_loadu_ps(b + i);
        
        // Sum up the floats.
        __m256 sum = _mm256_add_ps(x, y);
        
        // Write out the 8 floats to out.
        _mm256_storeu_ps(out + i, sum);
    }    
    
    // In case size is not divisible by 8.
    for(; i < size; i++) {
        out[i] = in1[i] + in2[i];
    }
}

AVX-512 will look very similar, using __m512 and _mm512 instead of __m256 and _mm256, respectively. (AVX-512 adds lots of other functionality besides longer vector registers, but it's not very relevant for this simple function.)

#include <immintrin.h>	

// Computes out[i] = a[i] + b[i]
void elementwise_add(int size, float* out, float* a, float* b) {
    int i = 0;
    for(; i + 15 < size; i += 16) {
        // Load 16 floats from a.
        __m512 x = _mm512_loadu_ps(a + i);
        
        // Load 16 floats from b.
        __m512 y = _mm512_loadu_ps(b + i);
        
        // Sum up the floats.
        __m512 sum = _mm512_add_ps(x, y);
        
        // Write out the 16 floats to out.
        _mm512_storeu_ps(out + i, sum);
    }    
    
    // In case size is not divisible by 16.
    for(; i < size; i++) {
        out[i] = in1[i] + in2[i];
    }
}

NEON intrinsics for ARM use 128-bit vector registers. The types are of the form {data}x{count}_t; for example, float32x4_t is a 128-bit register with 4 float32 values in it. Intrinsics start with "v" (for "vector") and end with a suffix indicating the data type, such as "_f32" for 32-bit floats. Instructions which operate on 128-bit registers have named that end in "q". For example, loading from memory is done with the vld1q_f32 intrinsic, storing to memory uses the vst1q_f32 intrinsic, and vaddq_f32 adds float32x4_t values. Putting it together, you get the following elementwise sum function:

#include <arm_neon.h>

// Computes out[i] = a[i] + b[i]
void elementwise_add(int size, float* out, float* a, float* b) {
    int i = 0;
    for(; i + 3 < size; i += 4) {
        // Load 4 floats from a.
        float32x4_t x = vld1q_f32(a + i);
        
        // Load 4 floats from b.
        float32x4_t y = vld1q_f32(b + i);
        
        // Sum up the floats.
        float32x4_t sum = vaddq_f32(x, y);
        
        // Write out the 4 floats to out.
        vst1q_f32(out + i, sum);
    }    
    
    // In case size is not divisible by 4.
    for(; i < size; i++) {
        out[i] = in1[i] + in2[i];
    }
}

Quantized Inference

SIMD registers generally fit a fixed number of bits (128, 256, or 512), but depending on our data type, these can hold different amounts of values. For example, a 256-bit register can hold 8 32-bit floats, 16 16-bit floats or ints, ant 32 8-bit ints. Instructions for multiplication and addition generally take a single cycle (that is, you can complete one such instruction per cycle) no matter what data they are operating on, which means that reducing the bit precision of our operands is a great way to accelerate our inference kernels.

Unfortunately, unlike GPUs, CPUs thus far tend to have poor support for float16. This means that in order to squeeze more speed out of our kernels, we're going to have to shift to quantized arithmetic and do our matrix-vector multiplies in int8 or int16.

Quantizing WaveRNN to 8 bits results in significant quality degradation unless it is trained in a quantization-aware way, but if we stick to 16-bit inference, we can accelerate inference while keeping audio quality high.

In order to do an int16 matrix-vector multiply, we can:

  1. Compute the maximum magnitude of each row, $\beta_r$.
  2. Rescale each row to the range [-8192, 8192] by multiplying by $\frac{8192}{\beta_r}$.
  3. Round each row to the nearest integer in int16.
  4. Compute the maximum magnitude $\alpha$ of the input vector.
  5. Rescale the input vector to the range [-8192, 8192] by multiplying by $\frac{8192}{\alpha}.$
  6. Round the input vector elements to the nearest integer in int16.
  7. For every row, compute the dot product with the input, doing multiplication in int16 and accumulation in int32.
  8. Scale result of the dot product to undo the scaling done on the inputs, multiplying the results by $\frac{\alpha \beta_r}{8192^2}.$

Storing a per-row maximum weight magnitude is convenient if the matrix-vector multiply is done row-wise; another alternative with slightly reduced precision is to store a single scaling factor for the entire matrix.

Since the weights are fixed, we can perform steps (1), (2), and (3) in advance. This allows us to reduce the amount of data we load from RAM by 2x. In theory, we could get up to a 2X speedup, but in practice, getting a 1.5X speedup from quantization is more doable.

Summary

WaveRNN inference can be fast, but making it fast requires a variety of low-level optimizations to the inference kernels. One crucial optimization is using SIMD instructions such as SSE, AVX, AVX-512, NEON, and SVE for arithmetic when the processor the kernel is running on supports it. Although compilers have auto-vectorizers to take advantage of these instructions, manually writing your arithmetic routines using compiler intrinsics or assembly can still provide a speed boost. A second optimization is int16 quantization – since twice as many int16 values fit in vector registers as float32 values, rewriting matrix-vector multiplies to operate primarily on int16 data can yield a speed boost. Together, these optimizations can speed up a WaveRNN kernel significantly, allowing you to synthesize audio faster than realtime.

Check out the implementation at gibiansky/wavernn or proceed to the subsequent blog posts: