### Brainstorming: Neural Transducers for Speech Synthesis

Neural transducers are commonly used for automatic speech recognition (ASR), often achieving state-of-the-art results for

This is a continuation of Part 1 of this two-part series. In this post, I'll try to go over the implementation of PQMF filters in sufficient detail such that you'll be able to use this technique in your own code.

In the previous post, I summarized by presenting a recipe for converting a neural vocoder into a sub-band vocoder:

**Design a Prototype Filter:**Choose a prototype filter $h[n]$ to use for your PQMF filter bank. The Kaiser window approach seems easiest here.**Compute PQMF Filter Bank:**Choose the number of bands $K$ you plan to use. Then, compute your analysis and synthesis filters $h_k[n]$ and $g_k[n]$ using the equations above.**Training Data Analysis:**Calculate sub-band signals for all your training data using the analysis filters.**Vocoder Training:**Train your neural vocoders to predict the sub-band signals. Although you can train separate models per sub-band, to get inference speed improvements you must modify your model to output all sub-band signals in each timestep. This will reduce the number of output timesteps by a factor of $K$, which should reduce inference time by approximately that same factor. For example, for WaveRNN, you can have each timestep output $K$ values and input the previous samples for each sub-band. For MelGAN or WaveGlow, you can have the output consist of $K$ channels which get combined to create the final audio using your synthesis filters.**Inference-time Synthesis:**After running your vocoder during inference, run synthesis on the outputs to get your new audio stream.

I left some of these pieces rather vague, and now it's time to fill in the details. Specifically, I'd like to address:

**Designing a Prototype Filter:**How do we compute a prototype filter?**Computing Analysis-Synthesis Filters:**How do we create the necessary filters?**Implementation with Vocoders:**How do we use these analysis and synthesis filters in standard deep learning frameworks?

The bulk of this post will be addressing prototype filters, and we'll be using PyTorch for all code samples.

In this section, we'll dive into a prototype filter design method based on Kaiser windows.

Prior to jumping in, let's go over some vocabulary from filter design. For me, all of these rang bells from my signal processing classes years ago, but it was valuable to go through each of them.

**Window Function:**A window function is a function which is zero outside of a given interval and, usually, symmetric around zero. Window functions are multiplied with a signal before doing Fourier transforms to avoid discontinuities at the edge (since Fourier transforms assume that the signal is repeating).**Bessel Function:**Bessel functions are a family of functions which are the solution to Bessel's differential equation, and are used in a variety of areas in physics, signal processing, etc. For our purposes, they're functions which can be used to create a bell-shaped window function.**Cutoff Frequency:**The cutoff frequency is the frequency at which the response of a filter begins to attenuate significantly, for example, at which the filter reduces the power by a factor of two.**Stopband Attenuation:**The stopband attenuation is the attenuation that must be attained in the stopband (the frequency range in which the filter is not supposed to let signal through).**Transition Bandwidth:**The transition bandwidth (as the name implies) is the width of the frequency range between the passband and the stopband; good filters will minimize this width.

**tl; dr: **You can do this with one call to

.`scipy.signal.firwin`

Given a window length $N$, Kaiser window filter $w(n)$ is the following:

This defines a window of length $N+1$ using the modified Bessel function $I_0$ of the first kind. In practice, this can be computed using `scipy.signal.windows.kaiser`

(which assumes $n$ is centered around zero by default).

The Kaiser window looks like you'd expect a window to look like:

Given a stopband attenuation $As$ and transition bandwidth $\Delta w$, the filter length should be approximately:

$$N \approx \frac{As -7.95}{14.36\Delta w} 2\pi.$$

The derivation for this specific fact is supposedly available here, but I am for the time being willing to take this on faith. Similarly, the needed value for $\beta$ is also a function of your stopband attenuation $As$.

Given your chosen window function $w(n)$, your final prototype filter can be computed from your desired cutoff frequency $\omega_c$:

$$h(n) = \frac{\sin(\omega_c n) w(n)}{\pi n}.$$

We have yet to compute the cutoff frequency $\omega_c$, which, according to our reference paper, is best computed by minimizing the objective function

$$\phi_{\text{new}}(\omega_c) = \max |h(n) * h(N - n)|.$$

(That is, given an $\omega_c$, compute the prototype filter, convolve it with its reverse, and take the maximum absolute value. More details discussed here, but this seems to be right.)

For computing this in practice, you can use `scipy.signal.firwin`

. `firwin`

allows you to specify a window function; additionally, if you specify the `width`

parameter, you can directly set the transition bandwidth and it will calculate the appropriate value of `beta`

for your Kaiser window.

Now that we have a prototype filter computed using `scipy.signal.firwin`

, we need to create a set of $K$ analysis filters $h_k(n)$ and synthesis filters $g_k(n)$. Recall from the previous post that these are defined as follows in PQMF filter banks:

$$\begin{align*}h_k[n] &= h[n] \cos\left(\frac{\pi}{4K}\left(2k + 1\right)\left(2n - N + 1\right) + \Phi_k\right)\\g_k[n] &= h_k[N - 1 - n]\\\Phi_k &= (-1)^{k} \frac{\pi}{4}.\end{align*}$$

Computing this is not too hard, just a bit of arithmetic. The process of using these filters is then equivalent to standard convolution.

With the math out of the way, let's convert this to code. Warning – *don't* copy this code and run it. I haven't tested it and I haven't run it, so it won't be useful to you that way. Instead, view this as a formalization of what I wrote above and trace through the logic to understand it.

```
def create_prototype_filter(bands: int, cutoff: float) -> np.ndarray:
"""Create a prototype filter."""
transition_bandwidth = 0.9 * np.pi / 2 / bands
attenuation = 100 # dB
taps = int(2 * np.pi * (attenuation -7.95) / (14.36 * transition_bandwidth))
return scipy.signal.firwin(
taps, cutoff, width=transition_bandwidth, fs=2 * np.pi
)
def optimize_cutoff(bands: int) -> float:
"""Choose the best cutoff frequency."""
options = np.linspace(0.0, np.pi, 100000)
best_cutoff, best_objective = None, 1e9
for option in options:
h = create_prototype_filter(bands, option)
objective = np.abs(np.convolve(h, h[::-1])).max()
if objective < best_objective:
best_cutoff, best_objective = option, objective
return best_cutoff
def create_filter_bank(bands: int):
"""Create the entire filter bank."""
cutoff = optimize_cutoff(bands)
proto = create_prototype_filter(bands, cutoff)
taps = proto.size
h = np.zeros((bands, taps))
g = np.zeros((bands, taps))
factor = (np.pi / (2 * bands)) * (
np.arange(taps + 1) - ((taps - 1) / 2)
)
for k in range(bands):
scale = (2 * k + 1) * factor
phase = (-1) ** k * np.pi / 4
h[k] = 2 * proto * np.cos(scale + phase)
g[k] = 2 * proto * np.cos(scale - phase)
return h, g
```

All in all, as is often the case with mathematical ideas, the amount of *code* it takes to implement all of this is pretty small. Thank you additionally to @kan-bayashi who's Github repo was hugely helpful in tracing through some of the logic required.

This approach definitely raises additional questions, which I look forward to answering or seeing answered:

- How does the speed and quality depend on the number of bands, the number of taps, and the prototype filter used?
- Instead of using this method to design the prototype filter, can we learn the prototype filter as part of vocoder training?
- Do we need to use PQMF filters? Could we instead learn both analysis and synthesis filters entirely from scratch?

Definitely looking forward to seeing where all of this goes! I'm consistently impressed with the rapid progress in neural speech synthesis, so I'm sure the answers will come quickly.