This post is a new kind of thing for me. Dr. Joel Harband wrote most of this post and I worked with him on the focus, the content and a little bit of editing - actually I couldn't help myself and I edited this a lot. So this is really a combined effort at this point.
As you know, Text-to-Speech is something that's very interesting to me and Joel knows a lot about it as CEO of Tuval Software Industries maker of Speech-Over Professional. This software adds text-to-speech voice narration to PowerPoint presentations and is used for training and eLearning at major corporations.
Joel was nice enough to jump in and share his knowledge of applying text-to-speech technology to eLearning.
Please let me know if this kind of things makes sense and maybe I'll do more of it. It certainly makes sense given all that's going on in my personal life.
Text-to-Speech Poised for Rapid Growth in eLearning
Text-to-speech (TTS) is now at the point where virtual classrooms were about 4 years ago when they reached a technological maturity where they were mainstream. It took a couple more years for me to say (in 2009) that virtual classrooms reached a tipping point.
Text-to-speech has reached the point of technical maturity. As such, we are standing at the threshold of a technology shift in our industry: text-to-speech voices are set to replace professional voice talents for adding voice narration in e-learning presentations. Text-to-speech can create professional voice narration without any recording which provides significant advantages:
- keeps narrated presentations continuously up to date (it's too time consuming/expensive to rerecord human narration)
- faster development - streamlined workflow
- lower costs.
It's being adopted today in major corporations, but it's still early in the adoption cycle. That said, at a developer’s conference in 2004, Bill Gates made the statement that that although speech technology was one of the most difficult areas, even partial advances can spawn successful applications. This is now the case for text-to-speech: it’s not yet perfect, but it is good enough for a whole class of applications, especially eLearning and training. The reason is that most people learn out of necessity and will accept a marginal reduction in naturalness as long as the speech is clear and intelligible.
There's a lot going on behind the scenes to make text-to-speech work in eLearning. Like most major innovations it needs to be accompanied by a slew of minor supporting innovations that make it practical, easy to use and effective: modulating the voice with speed, pitch and emphasis, adding silent delays, adding subtitles, pronouncing difficult words and coordinating voice with visuals.
Over the course of a few posts, we will attempt to bring readers up to speed on different aspects of this interesting and important subject. The focus of this post is around the quality of Text-to-Speech based on Natural Language processing.
To understand how to think about text-to-speech voices and how they compare, it's important to have some background about what they are. Text-to-speech (TTS) is the automatic production of spoken speech from any text input.
The quality criteria for Text-to-Speech Voices are pretty simple. They are:
Due to recent improvements in processing speed, speech recognition and synthesis, and the availability of large text and speech databases for modeling, text-to-speech systems now exist that meet both criteria to an amazing degree.
A TTS voice is a computer program that has two major parts:
- a natural language processor (NLP) which reads the input text and translates it into a phonetic language and
- a digital signal processor (DSP) that converts the phonetic language into spoken speech.
Each of these parts has a specific role and by understanding a bit more about what they do, you can better evaluate quality of the result.
Natural Language Processor (NLP) and Quality
The natural language processor is what knows the rules of English grammar and word formation (morphology). The natural language processor is able to determine the part of speech of each word in the text and thus to determine its pronunciation. More precisely, here's what the natural language processor does:
- Expands the abbreviations, etc to full text according to a dictionary.
- Determines all possible parts of speech for each word, according to its spelling (morphological analysis).
- Considers the words in context, which allows it to narrow down and determine the most probable part of speech of a word (contextual analysis).
- Translates the incoming text into a phonetic language, which specifies exactly how each word is to be pronounced (Letter-To-Sound (LTS) module).
- Assigns a “neutral” prosody based on division of the sentence into phrases.
This will make more sense by going through examples. And this also provides a roadmap to test quality.
We’ll compare the quality of three TTS voices:
- Mike - a voice provided by Microsoft in Windows XP (old style).
- Paul a voice by NeoSpeech - the voice used in Adobe Captivate.
- Heather a voice by Acapela Group.
Actually, let me have them introduce themselves. Click on the link below to hear them:
- I'm Mike, an old style robotic voice provided by Microsoft in Windows XP.
- I'm Paul, a state of the art voice provided by NeoSpeech.
- I'm Heather, a state of the art voice provided by Acapela-Group.
So, let's put these voices through their paces to see how they do. Actually, in this section, we are going to be testing the natural language processor and its ability to resolve ambiguities of parts of speech in the text.
1. Ambiguity in noun and verb
“Present” can be a noun or a verb, depending on the context. Let’s see how the voices do with the sentence:
“No time like the present to present this present to you.”
Paul and Heather resolve this ambiguity with ease.
Another example: “record” can be a noun or a verb:“Record the record in record time.”
Again, Paul and Heather resolve this ambiguity with ease
2. Ambiguity in verb and adjective
The word “separate” can be a verb or an adjective.
“Separate the cards into separate piles”
Only Paul gets it right.
3. Word Emphasis (Prosody)
Another type of ambiguity is word emphasis in a sentence: The intended meaning of a spoken sentence often depends on the word that is emphasized, as: “He reads well”, “He reads well”, He reads well”. This is called prosody and is impossible to determine from plain text only. The voices try to achieve a “neutral” prosody that tries to cover all possible meanings. A better way is to use modulation tags to directly emphasize a word. We’ll discuss that in a later post.
Most voices are equipped to translate common abbreviations.
The temperature was 30F, which is -1C.
It weighed 2 kg, which is about 4.5 lb.
Let's meet at 12:00
Heather does the best job.
5. Technical Words
Unless they are equipped with specialized dictionaries, TTS voices will occasionally fail to read technical words correctly. However they can be always be taught to say them correctly by using a phonetic language. Here are some examples. Each voice says the word twice: first by itself (incorrectly) and second after being taught (correctly).