Efficient WaveRNN: Optimizing Nonlinearities
Implementing efficient nonlinearities for WaveRNN CPU inference is tricky but critical.
Deep learning models for speech synthesis, such as Google's WaveNet and Tacotron, are not complete text-to-speech systems.
Summary: Deep learning models for speech synthesis, such as Google's WaveNet and Tacotron, are not complete text-to-speech systems. They are each just one part of a large pipeline of models and heuristics that together form a text-to-speech engine. WaveNet is not a text-to-speech engine. Tacotron isn't either.
In the past few years, researchers have designed many neural network architectures for synthesizing audio. The most commonly referenced ones are Google's WaveNet and Tacotron, but there are many, many others, such as Google's WaveRNN, Baidu's Deep Voice papers, NVidia's WaveGlow, SampleRNN, Microsoft's FastSpeech, and many others.
These papers are announced to great fanfare on company websites and tech news sites, with great audio samples and interesting demos. This gives the impression that the paper describes a complete text-to-speech system. Google even brands its neural vocoder voices in Google Cloud as WaveNet voices, which can obscure the fact that a significant part of the TTS engine pipeline is shared between the two engines. As a result, a lot of people online end up (understandably!) confused and refer to a "WaveNet TTS system" or "Tacotron TTS system", or assume that a Github repo with a re-implementation of one of these can be used to build a complete speech synthesis engine.
This blog post is an attempt to rectify this (slight) misconception.
A text-to-speech engine is a piece of software which converts text into speech (audio). This process is typically separated into a pipeline, where each step in the pipeline is its own model or set of models. An example pipeline might include:
This list does not include miscellaneous things such as networking, request parsing, audio encoding, etc.
A complete TTS engine has to do (more or less) all of these things and connect them all together.
WaveNet and Tacotron are neural network models that address one step of the above pipeline. Specifically, WaveNet is a neural vocoder, and is responsible for the "waveform synthesis" step of the pipeline. Tacotron is a sequence-to-sequence model for spectrogram synthesis, and addresses the "high level audio synthesis" step.
Now that we have these distinctions, we can ask more specifically: What models have been developed for each stage of this pipeline?
I personally know the authors of many of the papers I listed above and worked on several of them myself. I'm not listing them in any order or trying to promote my own papers. These are just the systems that came to mind. Don't judge me for my choices here.
Additionally, I'm one of the founders of Voicery, where we build custom text-to-speech engines based on models similar to the ones above. If you have any questions or are looking to deploy your own models like these, check out our demos and get in touch.