The naive approach to streaming audio synthesis using deep neural networks is to break up the input into chunks and then run synthesis on each chunk. Unfortunately, this introduces wasted computation and discontinuities. In this blog post, I present a simple and robust alternative.
Nowadays, there are a boatload of different deep neural networks for audio synthesis. There's Tacotron, Tacotron2, WaveRNN, WaveGlow, WaveFlow, LPCNet, MelGAN, MB-MelGAN, and fifty more other networks handling every part of the text-to-speech audio synthesis pipeline.
When synthesizing audio in a text-to-speech scenario, you might want to synthesize minutes of audio at a time, for example, if you are reading a Wikipedia page or a news article. At the same time, you want the user to experience low latency (<200ms), so that they can start listening to it immediately. None of the aforementioned networks can synthesize several minutes of audio within 200 milliseconds – which means that to satisfy both of these constraints, you have to start streaming your output before you're done synthesizing.
In this blog post, I'd like to share an easy way to implement audio streaming for any network.
Naive Approach: Run Inference on Chunks
The naive approach to audio streaming is to break up the input into chunks, and then run synthesis on each chunk separately, glueing the results together to form the final audio stream. For RNNs (Tacotron, WaveRNN, etc), pass the hidden state between the chunks.
For RNNs, this approach works fine – it's identical to just running inference on the entire utterance at once. With CNNs, however, you run into trouble.
For a non-causal convolution with width k, you need to have k-1 timesteps of padding (half that on either side) in order to output something of the same size as the input. This means that if you've broken your input into chunks, you now need to add padding on either side of your input.
If you add zero-padding, then the outputs near the edges of your chunk may create a discontinuity, resulting in a periodic artifact or slight quality degradation. If you add padding from the original input to the CNN, you have to pad by the total receptive field of the network, which ends up duplicating computation and slowing down your overall synthesis speed. (For example, for a 5 layer CNN with width 7 kernels, your receptive field is 1+5*(7-1)=31, so you need 30 timesteps of padding.)
One approach is to synthesize overlapping chunks and then average them to reduce the effect of the discontinuities; however, this just patches over the problem, rather than solving it.
When trying to stream through a model with transposed convolutions (such as MelGAN), you have the same issue but for transposed convolutions. If you mix convolutions and transposed convolutions, even computing the receptive field becomes a bit confusing!
In summary – although breaking your input into chunks and running inference on each chunk separately does work, it can introduce wasted computation and discontinuities at near the boundaries.
Alternative Approach: "Perfect Streaming"
The approach of chunking up your input, synthesizing the chunks, and then glueing them together leads to wasted computation and discontinuities at chunk boundaries. What can we do instead?
Instead, with a little bit of work, we can implement "perfect streaming". Perfect streaming results in an output that is identical (barring floating point error) to the same utterance synthesized without streaming – no wasted computation, no discontinuities at chunk boundaries.
To do this, we take the following approach:
- For every layer in your network, define an initial state, an update function, and a finish function. The update function incorporates new input and returns any available output; and the finish function completes the synthesis for that layer, returning any leftover outputs.
- Compose those layers together: to initialize the full network, collect the initial states of each of the layers; to run an input through the network, run the update functions of the network layers sequentially; to finish synthesis, call finish and update on each of the layers of the network.
The result will be a network that can be initialized, run streaming synthesis on chunks of input with update, and can be finalized to get the last bit of output with finish.
Next, we'll go through how to implement these for each of the common layers. Reading through these examples will hopefully make it clear how this ends up working.
LSTMs and GRUs
Streaming through RNNs such as LSTMs and GRUs is easy! The initial state is simply the initial state (of zeros) for the LSTM or the GRU.
The update function will take the input and run it through the network, outputting exactly as many timesteps as it had in the input while also updating the state with the final state of the RNN after it has run on all the inputs.
The finish function in this case does nothing – there are no leftover outputs to be emitted.
Streaming through Conv1D is slightly more complex than an RNN, because you need to manage the extra state, composed of past inputs to the model.
The initial state for a Conv1D with (odd) width k is a tensor of zeros with (k-1)/2 timesteps. (That is, a tensor of shape batch_size x (k-1)/2 x num_input_channels.)
The update function will take the input, concatenate it with the state, and then run the Conv1D (in "valid" mode, with no extra padding), returning any resulting outputs. The new state is the last k-1 timesteps of the concatenated input tensor.
Finally, the finish function will take the state and pad it on the right with (k-1)/2 timesteps of zeros, run the Conv1D (with no extra padding), and return the resulting outputs.
To make this concrete, let's work through an example. Let's say we have a Conv1D layer with width 7, 256 input channels, operating with batch size 16. We initialize the state to a tensor of zeros of shape [16, 3, 256]. We have three input chunks of 4 timesteps each (total input size 12). We start by running update with the first chunk, creating a tensor of 3 + 4 = 7 inputs; after running the Conv1D, the layer produces 1 output. The last 6 of these are kept as the state. When we run update with the second chunk, we create a tensor of shape 6 + 4 = 10, which after we run Conv1D, produces 4 outputs. When we run update with the third chunk, we again produce 4 outputs. Finally, we run finish, which takes the state of 6 timesteps, pads them with 3 timesteps of zeros on the right, resulting in input size 9; after we run Conv1D, we produce 3 outputs. In total, we produce 1 + 4 + 4 + 3 outputs, for a total of 12 outputs – exactly as many outputs as we had inputs.
A causal Conv1D is very similar to a non-causal Conv1D. However, instead of the initial state having (k-1)/2 timesteps, and then finish padding the sequence with (k-1)/2 timesteps on the right, we start with an initial state of k-1 timesteps of zeros. Besides that, everything stays the same.
A transposed convolution with stride upsamples the input; for this example, though, we'll assume a stride of one for simplicity. The implementation for a transposed conv looks very similar to a standard convolution – but instead of the state representing future inputs, the state represents a component of future outputs.
The key observation to make here is that when we break up a transposed convolution into two chunks, the outputs near the edge have contributions from both chunks. Each input timestep contributes to k different outputs, which means the last input timestep in a chunk affects the first (k-1)/2 outputs of the next chunk. We need to make sure to accumulate those outputs and return them only when all of their inputs' contributions have been accounted for.
The initial state for a transposed Conv1D with (odd) width k is a tensor of zeros with (k-1)/2 timesteps. (That is, a tensor of shape batch_size x (k-1)/2 x num_output_channels. Output channels, not input channels!)
The update function for an input with n timesteps will take the input and run it through the transposed convolution, generating n + k - 1 output timesteps. Take the current state and add it to the left edge of the output; that is, if you have (k-1)/2 timesteps of state, set the left (k-1)/2 timesteps to the elementwise sum of the state and the first timesteps of the transposed convolution outputs. If this is the first time update is being called, throw away the first (k-1)/2 timesteps. Return all but the last k-1 timesteps. Keep the last k-1 timesteps of the output as your state.
The finish function returns the first (k-1)/2 timesteps of the state as output.
To help understand this one, let's go through an example again; we'll use the same setup as we did with non-causal Conv1D. Let's say we have a transposed Conv1D layer with width 7, 256 output channels, operating with batch size 16. We initialize the state to a tensor of zeros of shape [16, 3, 256]. We have three input chunks of 4 timesteps each (total input size 12). We start by running update with the first chunk. With an input of width 4, the transposed Conv1D layer will create a tensor of 4 + 6 = 10 outputs. On the first time we call update, we drop the left 3 timesteps and return the next one timestep of outputs. The last 6 timesteps of output are kept as the state. When we run update with the second chunk, we get an tensor of shape 6 + 4 = 10 again, and we add the state (6 timesteps) to the left edge of that tensor. We return the first four timesteps and keep the last 6 timesteps as state again. When we run update with the third chunk, we again produce 4 outputs. Finally, we run finish, we take the first 3 timesteps of the state and return that. In total, we produce 1 + 4 + 4 + 3 outputs, for a total of 12 outputs – exactly as many outputs as we had inputs.
Concat and Sum
For some models (MelGAN, Resnets), you will end up with the output of two different subnetworks being concatenated or summed. These different subnetworks may have different inputs and thus might, during streaming, output different numbers of timesteps. To address this, you need to define update and finish for concat and sum operators, just like you do for compute layers such as convolutions and RNNs.
In MelGAN, you'll see this used for the generator resnets. The resnets use convolutions, which means that they output less timesteps than their input (initially), so naively summing the output of the resnet with its input doesn't work (since they'll have different numbers of timesteps).
For both concat and sum, the update function should take the minimum number of timesteps in the inputs and return the concatenation or sum of that many timesteps, storing all leftovers in the state. The finish function should – if everything works out – be left with nothing in the state, since by the time finish is called, all the inputs should be available with the same number of timesteps.
You can implement these in practice as methods on the relevant Pytorch or TensorFlow layers. Then, when you use torch.nn.Sequential or keras.Sequential, you can compose the different layers into a single network with the same methods.
In the end, you end up with a flexible structure where you can freely experiment with RNNs, convolutions (with stride, dilation, etc), transposed convolutions, concat, resnets, and anything else, mixing the layers in any order and size, knowing that at inference your streaming implementation will be exactly identical to a non-streaming implementation. To test your implementation, you can write a unit test which tests every layer or network by running inference with an utterance and then running the same input but with its input broken into chunks, and then using np.allclose to verify that the outputs are identical.
If you have any questions about this technique, feel free to reach out and email me or find me on Twitter!