### Brainstorming: Neural Transducers for Speech Synthesis

Neural transducers are commonly used for automatic speech recognition (ASR), often achieving state-of-the-art results for

Neural transducers are commonly used for automatic speech recognition (ASR), often achieving state-of-the-art results for quality and inference speech; for instance, they power Google's offline ASR engine. In this post, I'd like to propose a neural transducer model for speech synthesis. I'm writing this idea down before trying this model, so the reality is that this is just a fun avenue for me to brainstorm – and maybe someday I'll try this, or it'll inspire someone else to go down a similar path!

A neural transducer models the alignment between an input and output sequence, but does so in a way that allows aggregating label probabilities over all possible (monotonic) alignments. Given that ability, we can maximize the probability of a sample label and marginalize over the latent alignment.

A transducer model has the following inputs:

- An input sequence $x_i$ of length $N$. For ASR, this is usually spectrogram frames or an analogous audio representation.
- A label sequence $y_t$ of length $T$. For ASR, this is a sequence of letters, phonemes, or word pieces.
- A character set of size $V$. The output sequence consists of these characters.

The model itself has three components:

- An encoder network, $E(x)$. Outputs an encoding vector $h_t$ for each input timestep.
- A decoder network, $D(y_{1:t})$. Outputs an encoding vector $g_u$ for each output prefix. (Similar to an autoregressive language model.)
- A joint prediction network, $P(h_t, g_u)$. Outputs a probability distribution over the model character set plus the blank token $\varnothing$. Let index zero in this distribution represent $\varnothing$.

The encoder network is usually a bidirectional network of some kind, the decoder network is a unidirectional RNN or causal convolution model, and the prediction network is a multilayer feedforward model. The encoder runs over all $N$ input timesteps, the decoder runs over all $T$ timesteps, and the prediction network runs over all $N\cdot T$ pairs of encoding vectors from those two models.

The output of the prediction network $P(h_t, g_u)_k$ is the probability that, if input timestep $i$ was aligned to output timestep $t$, the next token is the $k$th character in the character set (if $k > 1$) or that the alignment should advance to the next encoder vector (if $k = 0$ for $\varnothing$).

Given this matrix of probability distributions, we can maximize over the probability of a label while marginalizing (summing) over alignments using a forward-backward algorithm. Given a specific alignment, the loss reduces to a cross-entropy next-character language-model-like prediction loss; introducing the blank character and joint network allows you to write out every possible alignment and its loss. Since there's an exponentially large number of possible alignments, we use a forward-backward algorithm, which allows summing over the alignments with a dynamic programming algorithm.

The RNN-transducer (RNN-T) model and loss is a great fit for end-to-end ASR model. However, it's initially a poor fit for text-to-speech (TTS) models: the model relies on a single discrete output per timestep, but TTS models such as Tacotron output continuous-valued spectrograms.

To fix this, I'd like to propose a modified RNN-T model called the "*generative transducer*". This model has the following components:

- An encoder network, $E(x)$. Outputs an encoding vector $h_t$ for each input timestep.
- A decoder network, $D(y_{1:t})$. Outputs an encoding vector $g_u$ for each output prefix.
- A controller network, $C(y)$. Outputs an encoding vector $c_u$ for each output timestep.
- A joint prediction network, $P(h_t, g_u)$. Outputs whatever the output of the model should be; in the case of TTS, this could for example be an 80-dimensional log-mel spectrogram.
- A joint controller network, $\varnothing(h_t, c_u)$. Outputs the probability of advancing to the next encoder timestep instead of making this prediction.

In this model, we've decoupled the model output network $P(h_t, g_u)$ from the alignment generator network $\varnothing$; this model is strictly a generalization of the standard transducer model.

In the same way as before, we can use a forward-backward algorithm to sum over alignments. Instead of maximizing the probability of a particular label, we can maximize the expected value of the loss, as if we are sampling from the alignment distribution.

This setup provides us with an extra interesting modeling choice. If $C(y)$ is a bidirectional network, it can use future output context to improve its transition prediction. However, this means that we cannot use $C(y)$ at inference time. For TTS, this means that we must extract the most likely alignments once this model is trained, convert them to phoneme durations, and then at inference have a separate model to predict phoneme durations. If $C(y) = C(y_{1:t})$ is a unidirectional model, however, we don't need an external phoneme duration model; we can alternate running $D(y_{1:t})$ to predict a frame and sampling from $C(y_{1:t})$ to decide when to transition to the next phoneme.

Next, I'd like to go through the practical considerations of computing this loss function, similar to how the RNN-T paper derives forward and backward equations for the RNN-T loss.

**Forward Pass**

Similar to RNN-T, we'll compute transition probabilities $\varnothing(t, u)$ and outputs $P(h_t, g_u)$ for every point $(t, u)$. To compute the expected loss, we'll sum the loss for every $(t, u)$ weighted by the probability of encoder timestep $t$ being aligned to decoder timestep $u$.

So, let $\alpha(t, u)$ be the probability that encoder timestep $t$ is aligned with decoder timestep $u$. We can compute $\alpha(t, u)$ recursively via

$$\begin{align*}\alpha(t, u) &= \alpha(t - 1, u) \varnothing(t - 1, u) + \alpha(t, u - 1) (1 - \varnothing(t, u - 1)) \\ \alpha(1, 0) &= 1\end{align*}$$

The probability of emitting an output when encoder timestep $t$ is aligned with decoder timestep $u$ is then $\alpha(t, u) (1 - \varnothing(t, u))$.

The prediction network $P(h_t, g_u)$ yields an output vector for every point $(t, u)$. Let the loss for that output vector be $L(t, u).$ To create a Tacotron-like model, we would use an autoregressive L2 loss such as

$$L(t, u) = \left(P(h_t, g_u) - y_{u+1}\right)^2.$$

We could similarly use a discrete cross-entropy loss if our outputs are discretized.

Finally, the overall loss $L$ will be the expected value of $L(t, u)$ when summed over all the timesteps:

$$L = \sum_{t=1}^T \sum_{u=1}^U \alpha(t, u) (1 - \varnothing(t, u)) L(u, v).$$

**Backward Pass**

Backpropagating through the tensor operations that yield $L$ (from $\alpha(t, u)$) is easy. However, we will need a custom op to compute the full partials with respect to $\alpha(t, u)$ as well as $\varnothing(t, u)$, since $\alpha(t, u)$ is used in computation of future $\alpha(t + 1, u)$ and $\alpha(t, u + 1).$ For notational simplicity, let $\delta(t, u) = \frac{\partial L}{\partial \alpha(t, u)}$. We can compute the $\delta(t, u)$ via the recurrence relation (with base cases)

$$\begin{align*}\delta(t, u) &= \delta(t, u)_\text{base} + \varnothing(t + 1, u)\delta(t + 1, u) + (1 - \varnothing(t, u + 1)) \delta(t, u + 1) \\ \delta(T, u) &= \delta(T, u)_\text{base} + (1 - \varnothing(t, u + 1)) \delta(t, u + 1) \\ \delta(t, U) &= \delta(t, U)_\text{base} + \varnothing(t + 1, u)\delta(t + 1, u) \\ \delta(T, U) &= \delta(T, U)_\text{base}\end{align*}$$

$\delta(t, u)_\text{base}$ is the contribution to the partial that was backpropagated from $L$ via tensor operations.

This proposed model is an extension of the RNN-T model to situations with a continuous-valued output. We create an external alignment model which just models the transitions in the encoder, and then use its outputs to compute an expected value over the loss in all possible alignments.

As described, there's nothing penalizing the model for not using the encoder inputs; we in no way require the model to use all of its encoder inputs. Perhaps extra loss terms that encourage $\alpha(T, U)$ to be high and $\alpha(t, U)$ to be low for $t < T$ would be valuable.