Text-to-Speech Overview and NLP Quality

Tuesday, June 08, 2010

Text-to-Speech Overview and NLP Quality

This post is a new kind of thing for me. Dr. Joel Harband wrote most of this post and I worked with him on the focus, the content and a little bit of editing - actually I couldn't help myself and I edited this a lot. So this is really a combined effort at this point.

As you know, Text-to-Speech is something that's very interesting to me and Joel knows a lot about it as CEO of Tuval Software Industries maker of Speech-Over Professional. This software adds text-to-speech voice narration to PowerPoint presentations and is used for training and eLearning at major corporations.

Joel was nice enough to jump in and share his knowledge of applying text-to-speech technology to eLearning.

Please let me know if this kind of things makes sense and maybe I'll do more of it. It certainly makes sense given all that's going on in my personal life.

Text-to-Speech Poised for Rapid Growth in eLearning

Text-to-speech (TTS) is now at the point where virtual classrooms were about 4 years ago when they reached a technological maturity where they were mainstream. It took a couple more years for me to say (in 2009) that virtual classrooms reached a tipping point.

Text-to-speech has reached the point of technical maturity. As such, we are standing at the threshold of a technology shift in our industry: text-to-speech voices are set to replace professional voice talents for adding voice narration in e-learning presentations. Text-to-speech can create professional voice narration without any recording which provides significant advantages:

keeps narrated presentations continuously up to date (it's too time consuming/expensive to rerecord human narration)
faster development - streamlined workflow
lower costs.

It's being adopted today in major corporations, but it's still early in the adoption cycle. That said, at a developer’s conference in 2004, Bill Gates made the statement that that although speech technology was one of the most difficult areas, even partial advances can spawn successful applications. This is now the case for text-to-speech: it’s not yet perfect, but it is good enough for a whole class of applications, especially eLearning and training. The reason is that most people learn out of necessity and will accept a marginal reduction in naturalness as long as the speech is clear and intelligible.

There's a lot going on behind the scenes to make text-to-speech work in eLearning. Like most major innovations it needs to be accompanied by a slew of minor supporting innovations that make it practical, easy to use and effective: modulating the voice with speed, pitch and emphasis, adding silent delays, adding subtitles, pronouncing difficult words and coordinating voice with visuals.

Over the course of a few posts, we will attempt to bring readers up to speed on different aspects of this interesting and important subject. The focus of this post is around the quality of Text-to-Speech based on Natural Language processing.

Text-to-speech Basics

To understand how to think about text-to-speech voices and how they compare, it's important to have some background about what they are. Text-to-speech (TTS) is the automatic production of spoken speech from any text input.

The quality criteria for Text-to-Speech Voices are pretty simple. They are:

Naturalness
Intelligibility

Due to recent improvements in processing speed, speech recognition and synthesis, and the availability of large text and speech databases for modeling, text-to-speech systems now exist that meet both criteria to an amazing degree.

A TTS voice is a computer program that has two major parts:

a natural language processor (NLP) which reads the input text and translates it into a phonetic language and
a digital signal processor (DSP) that converts the phonetic language into spoken speech.

Each of these parts has a specific role and by understanding a bit more about what they do, you can better evaluate quality of the result.

Natural Language Processor (NLP) and Quality

The natural language processor is what knows the rules of English grammar and word formation (morphology). The natural language processor is able to determine the part of speech of each word in the text and thus to determine its pronunciation. More precisely, here's what the natural language processor does:

Expands the abbreviations, etc to full text according to a dictionary.
Determines all possible parts of speech for each word, according to its spelling (morphological analysis).
Considers the words in context, which allows it to narrow down and determine the most probable part of speech of a word (contextual analysis).
Translates the incoming text into a phonetic language, which specifies exactly how each word is to be pronounced (Letter-To-Sound (LTS) module).
Assigns a “neutral” prosody based on division of the sentence into phrases.

This will make more sense by going through examples. And this also provides a roadmap to test quality.

We’ll compare the quality of three TTS voices:

Mike - a voice provided by Microsoft in Windows XP (old style).
Paul a voice by NeoSpeech - the voice used in Adobe Captivate.
Heather a voice by Acapela Group.

Actually, let me have them introduce themselves. Click on the link below to hear them:

I'm Mike, an old style robotic voice provided by Microsoft in Windows XP.
I'm Paul, a state of the art voice provided by NeoSpeech.
I'm Heather, a state of the art voice provided by Acapela-Group.

So, let's put these voices through their paces to see how they do. Actually, in this section, we are going to be testing the natural language processor and its ability to resolve ambiguities of parts of speech in the text.

1. Ambiguity in noun and verb

“Present” can be a noun or a verb, depending on the context. Let’s see how the voices do with the sentence:

“No time like the present to present this present to you.”

Mike
Paul
Heather

Paul and Heather resolve this ambiguity with ease.

Another example: “record” can be a noun or a verb:

“Record the record in record time.”

Mike
Paul
Heather

Again, Paul and Heather resolve this ambiguity with ease

2. Ambiguity in verb and adjective

The word “separate” can be a verb or an adjective.

“Separate the cards into separate piles”

Mike
Paul
Heather

Only Paul gets it right.

3. Word Emphasis (Prosody)

Another type of ambiguity is word emphasis in a sentence: The intended meaning of a spoken sentence often depends on the word that is emphasized, as: “He reads well”, “He reads well”, He reads well”. This is called prosody and is impossible to determine from plain text only. The voices try to achieve a “neutral” prosody that tries to cover all possible meanings. A better way is to use modulation tags to directly emphasize a word. We’ll discuss that in a later post.

4. Abbreviations

Most voices are equipped to translate common abbreviations.

The temperature was 30F, which is -1C.

It weighed 2 kg, which is about 4.5 lb.

Let's meet at 12:00

Mike
Paul
Heather

Heather does the best job.

5. Technical Words

Unless they are equipped with specialized dictionaries, TTS voices will occasionally fail to read technical words correctly. However they can be always be taught to say them correctly by using a phonetic language. Here are some examples. Each voice says the word twice: first by itself (incorrectly) and second after being taught (correctly).

Deoxyribonuclease (dee-ok-si-rahy-boh-noo-klee-ace)

Mike
Paul

Chymotrypsinogen (kahy-moh-trip-sin-uh-juh

Mike
Paul

More Information

9 comments:

Anonymous said...: Hi

Very Interesting topic - hope you do more. There is no 'link'to the voices - is it possible to do this on your next Blog please.; 6/08/2010 06:19:00 PM
Unknown said...: Hi

I have been experimenting with text-to-speech for a while. Majority of TTS sound robotic. They may be good for short communication but do not hold attention too well for prolonged lecture delivery.

I found one from Neospeech, a Korean company that was affordable and sounded natural. This video here

http://www.youtube.com/watch?v=dSA7Wd0ZcyU

was done with their version 1.0 TTS engine. The newer version is slightly better and SAPI5 compliant that allows to add some commands in an XML file along with the text.

However, company is hard to reach and two times I when I bought the program from them I had to download it through FTP from their website and their DRM system makes the program work only on one computer. Once the computer is replaced the TTS software goes with that.

No wonder they did not gain much traction in the consumer market.; 6/08/2010 06:29:00 PM
Sreya Dutta said...: Hi Tony,

Interesting you ventured into writing something new. TTS is a technology I had tried in 2002 and it wasn't very good . So our client rejected it then. NLP is new and this post was very informative. Made me start thinking of trying to leverage the capability for some demos we're creating.

Thanks for all the research and work that went into this. This post will be a great reference.

Sreya; 6/09/2010 04:30:00 AM
laurie mccune said...: Very nice post on text to speech and NLP. Thank you for your compilation of resources.; 6/09/2010 06:28:00 AM
John Zurovchak said...: Tony,

Thank you for a very informative and interesting post. Please do continue to explore TTS and how it may come to be used more in elearning. We have used basic TTS only on occasion, but as you pointed out, we are looking to use it more often in the future as it improves in quality.

This was an excellent primer for understanding what to look for in TTS technologies and the complexity that they have to deal with in presenting "authentic" speech to the learner.

Thanks again!; 6/09/2010 06:48:00 AM
Anonymous said...: Tony,

I hope you can cover some of the mark up languages for text-to-speech apps. I've worked with Captivate's VTML and had some success adding character to the voices.

Regards,

Larry Irons, PhD
http://www.customerclues.com
http://www.skilfulminds.com; 6/10/2010 02:27:00 PM
gih said...: Very impressive, as usual i like the same idea to the new technology had been proposed.; 6/12/2010 06:28:00 AM
Unknown said...: Hey Tony,

Very timely post for me as my new boss was asking about TTS in an e-mail last night.

I like the collaborative approach and the embedded examples. Great to show the evolution of the product. I played with TTS a few (5?) years back with some DL courseware and we ultimatley gave up and used voice actors.

I think the technology has come a long way! I am looking forward to any follow on discussion on this topic!; 6/22/2010 10:10:00 AM
Anonymous said...: I’ve used this method of learning in the past but hadn’t realized the technicalities involved until now. My experience with TTS was actually a very good one because I found it to be effective in the learning process. I can certainly imagine the difficulties in eliminating ambiguity to create a more natural product. I learned a great deal from your post and enjoyed finding out about the challenges involved. I’m curious to see where this technology takes us in the near future. Great information…thanks.

Question: The cost here seems a little steep…does anyone believe the cost is worth the benefits we will gain from using TTS in the learning/teaching process?

Angela; 11/21/2010 01:21:00 PM