Overview of text to speech
Here, we will give some general information about TTS algorithms. It is not our ambition to thoroughly tackle the different components of the field, which is quite a complex task and requires cross-domain knowledge in areas like linguistic or signal processing.
We will stick to the following high-level questions: what makes a TTS system good or bad? How is it evaluated? What are some traditional techniques, and why does the field need to move toward deep learning? We will also prepare for the next sections by giving a few basic pieces of information on spectrograms.
Naturalness versus intelligibility
The quality of a TTS system is traditionally assessed through two criteria: naturalness and intelligibility. This is motivated by the fact that people are not only sensitive to what the audio content is, but also to how that content is delivered. Basically, we want a TTS system that can produce clear audio content in a human-like way. More precisely, intelligibility...