This is the second post in a series on Text-to-Speech for eLearning written by Dr. Joel Harband and edited by me (which turns out to be a great way to learn). The first post, Text-to-Speech Overview and NLP Quality, introduced the text to speech voice and discussed issues of quality related to its first component – the natural language processor (NLP). In this post we’ll look at the second component of a text to speech voice: the digital signal processor (DSP) and its measures of quality.
Digital Signal Processor (DSP)
The digital signal processor translates the phonetic language specification of the text produced by the NLP into spoken speech. The main challenge of the DSP is to produce a voice that is both intelligible and natural. Two methods are used:
- Formant Synthesis. Formant Synthesis seeks to model the human voice by computer-generated sounds, using an acoustic model. Typically, this method produces intelligible, but not very natural, speech. These are the robotic voices, like MS Mike, that people often associate with text to speech. Although not acceptable for eLearning, these voices have the advantages of being small and fast programs and so they find application in embedded systems and in applications where naturalness is not required as in toys and in assistive technology.
- Concatenative Synthesis. To achieve the remarkable naturalness of Paul and Heather, concatenative synthesis is used. A recording of a real human voice is broken down into acoustic units: phonemes, syllables, words, phrases and sentences and stored in a database. The processor retrieves acoustic units from the database in real time and connects (concatenates) them together to best match the input text.
Concatenative Synthesis and Quality
When you think about how concatenative synthesis works – joining together a lot of smaller sounds to form the voice, it suggests where there can be glitches. Glitches will occur either because there’s not a recorded version of exactly what the sound should be or will occur where the segments are joined when it doesn’t come together quite right. The main strategy is to try to choose database segments that are as long as possible– phrases and even sentences – to minimize the number of connection glitches.
Here is an example of a glitch in Paul when joining the two words “bright” and “eyes”. (It wasn’t easy to find a glitch in Paul – finally found one in a Shakespeare sonnet!)
- Mike - bright eyes
- Heather - bright eyes
- Paul - bright eyes
The output from the best concatenative systems is often indistinguishable from real human voices. Maximum naturalness typically requires speech databases to be very large so the larger the database the higher the quality. Typical TTS voice databases that will be acceptable in eLearning, will be on the order of 100-200 Mb. For lower fidelity applications like telephony, the acoustic unit files can be made smaller by using a lower sampling rate without sacrificing intelligibility and naturalness, making a smaller database (smaller footprint).
By the way, the database is only used to generate the sounds which are then stored as .wav, .mp3, etc. It is not brought along with the eLearning piece itself. So a large database is generally a good thing.
Here is a list of the TTS voices offered by NeoSpeech, Acapela and Nuance with their file sizes and sampling rates.
Voice | Vendor | Sampling rate (kHz) | File Size (Mb) | Applications |
Paul | NeoSpeech | 8 | 270 (Max DB) | Telephone |
Paul | NeoSpeech | 16 | 64 | Multi-media |
Paul | NeoSpeech | 16 | 490 (Max DB) | Multi-media |
Kate | NeoSpeech | 8 | 340 (Max DB) | Telephone |
Kate | NeoSpeech | 16 | 64 | Multi-media |
Kate | NeoSpeech | 16 | 610 (Max DB) | Multi-media |
Heather | Acapela | 22 | 110 | Multi-media |
Ryan | Acapela | 22 | 132 | Multi-media |
Samantha | Nuance | 22 | 48 | Multi-media |
Jill | Nuance | 22 | 39 | Multi-media |
The file size is a combination of the sampling rate and the database size, where the database size is related to the number of acoustics units stored. For example, voices 2 and 3 have the same sampling rate, 16, but voice 3 has a much bigger file size because of the larger database size. In general, the higher sampling rates are used for multimedia applications and the lower sampling rates for telecommunications. Often larger sizes also indicate a higher price point.
The DSP voice quality is then a combination of the two factors: the sampling rate, which determines the voice fidelity and the database size which determines the quality of concatenation and frequency of glitches – the more acoustic units stored in the database, the better the chances of achieving a perfect concatenation without glitches.
And don’t forget to factor in Text-to-Speech NLP Quality. Together with DSP quality you get the overall quality of different Text-to-Speech solutions.
Fascinating stuff, Tony. I know our department longs for the day when TTS is viable, especially for localization. Our production costs would drop significantly.
ReplyDeleteHmm, business is the focus here, I think...
ReplyDeleteI would like to have learned more about what the "glitch" was in Paul. His rendition of "bright eyes" sounded the more natural of the three examples offered.
ReplyDeletePeggy