Using Punctuation and Mark-Up Language to Increase Text-to-Speech Quality

This post is part of the series on Text-to-Speech (TTS) for eLearning written by Dr. Joel Harband and edited by me. The other posts are: Text-to-Speech Overview and NLP Quality, Digital Signal Processor and Text-to-Speech, Using Text-to-Speech in an eLearning Course, Text-to-Speech eLearning Tools - Integrated Products, and seeming the most popular of the series so far: Text-to-Speech vs Human Narration for eLearning.

One of the concerns raised by various comments during the series has been around the quality of the results of Text-to-Speech (TTS) Voices and if that was suitable for eLearning. This issue was partly addressed in the previous post. In this post we’ll take a different cut at it by looking at how authors can use punctuation and mark-up language with TTS voices to bring out the meaning of the text more accurately and to make them more interesting. Using these techniques a voice can be made similar enough to human narration to hold a learner’s interest during an entire eLearning course - with a retention rate equivalent to that of a human voice.

Value and Concern Around Voice-Over

Before we jump into this specific topic, let’s look back at some of the specifics from last month’s Big Question - Voice Over in eLearning. Here’s a very quick summary of some of the responses regarding the added learning value of a voice-over as opposed to plain screen text:

Audio provides an additional channel of information which the brain can process in parallel with the visual information [Kapp].
A voice should not just read screen text [Kapp] but can optionally be supported by running subtitles at the bottom of the slide as in Captivate and Speech-Over [Joel].
A great deal more information per slide can be transferred with voice than with plain text. One minute of speech is equivalent to 125 words – which would crowd the slide considerably [Joel].
A lively and interesting voice can motivate learning and increase retention. [Mike Harrison]
A voice can often express the intended meaning more accurately than plain text by changing speed, volume and pitch, emphasizing words, and pausing for emphasis [Mike Harrison] (This is the prosody that we discussed in the first post). For example: He reads well. He reads well. He reads well.

It’s these last two points that relate closely to this topic. Ultimately, we would like the voice (human or TTS) to be lively and interesting, help increase motivation and learning, and convey the meaning more accurately.

Some of the concern around the use of Text-to-Speech Voices in eLearning is whether you can achieve that level of voice use.

Making the Author into a Voice Talent

Today’s post aims to show that with state-of-the-art tools that simplify the use of markup language, like Speech-Over Professional, TTS voices can easily be made interesting as well as prosody-accurate (points 4 and 5 above).

The concept presented here is a bit of a change in thinking:

An author together with a TTS voice is equivalent to a voice talent!

While handling the grammar quite well, the TTS voice by itself cannot know the nuances and emphases (prosody) needed to bring out the intended meaning of the sentence and will produce a compromise prosody. Authors need to fill the gap. Some people in the world of TTS call them “Text Authors.” Throughout this post, we will refer to them simply as “authors” as they likely are also the course author. Authors know what the voice should sound like, they use punctuation and mark-up language to makes the TTS voice achieve the intended meaning and clarity as well as enlivening it.

In some ways this is not that new for people who have worked with voice talent before. If you’ve ever worked a recording session, you will sit there and listen to what’s being said and often correct the phrasing, pronunciation, pacing, and other aspects of how the voice talent is handling the script that you have written. What we are saying is that there’s an equivalent operation when dealing with TTS Voices. You need to listen to the results and make corrections. Of course as we’ve pointed out in Using Text-to-Speech in an eLearning Course, the effort to make changes is likely substantially less.

The Basics

Let’s see an example of what we are talking about. Here is a clip of the TTS voice Heather reading Elizabeth Barrett Browning’s poem “How I love thee?” produced by Speech-Over Professional.

How I love thee?

How do I love thee? Let me count the ways.

I love thee to the depth and breadth and height

My soul can reach, when feeling out of sight

For the ends of Being and ideal Grace.

I love thee to the level of every day's

Most quiet need, by sun and candlelight.

I love thee freely, as men strive for Right;

I love thee purely, as they turn from Praise.

I love with a passion put to use

In my old griefs, and with my childhood's faith.

I love thee with a love I seemed to lose

With my lost saints, I love thee with the breath,

Smiles, tears, of all my life! and, if God choose,

I shall but love thee better after death.

When you listen there are a few simple uses of punctuation and markup language with Speech-Over Professional’s SAPI editor that provide some improvements to how the default would have read this.

The Speech-Over SAPI editor shown above lets authors apply markup language quickly and accurately with simple text symbols, which are as easy to use as ordinary punctuation. The symbols used in this example are the em-dash (—) which inserts a 0.5 sec silent delay and the right and left arrows (⊳,⊲) which decrease and increase the voice speed by one unit.

Listen to the effect of ordinary punctuation on the voice in the example:

The question mark is obvious - Heather expresses it very nicely.
The colon after "Let me count the ways:" gives a feeling of expectation for what’s to come. Putting a comma or period there would not give the same flow. Colons are generally used to introduce sequences to good effect.
Commas are used to give phrasing and resolve ambiguous sentences. They are a powerful tool and are used more often than proper punctuation would require.

Listen also to the effect of the markup language:

A delay (—) was placed between “How do I love thee” and “Let me count the ways” to express a slight hesitation for thought and then again after “Let me count the ways” to further hesitate for thought before stating the reasons.
Delays are also inserted throughout introduce the hesitations that make the voice more realistic.
The decrease and increase in speed for groups of words give them a slight accent and emphasis. For example, the words “I love thee”, “most quiet need”, etc have a speed decrease before them and a return to normal speed afterwards to give them a slight accent, depth, and emotional content. The amount of accent is controlled by the amount of speed reduction two units (⊳⊳) or one (⊳). A similar effect can be achieved by the emphasis tag (!!).

Also Heather’s natural slight Southern accent is because she is made from a real Southerner’s voice!

Now let’s see these concepts more in detail.

Using Punctuation

The judicious use of punctuation goes a long way towards making the voices more expressive and precise, especially the comma and the colon.

Let’s see how the prosody of the following sentence becomes clearer as we add punctuation:

A color is described in three ways by its name how pure it is and its value. (no punctuation) Paul
A color is described in three ways: by its name how pure it is and its value. (adding a colon for expectation) Paul
A color is described in three ways: by its name, how pure it is, and its value. (adding commas for phrasing) Paul

In our experience, the really good voices like Paul and Heather do quite well on their own most of the time with well-placed commas, colons, and silent delays only.

Mark-Up Language

As we mentioned in the first post, many “small” innovations are needed to make text to speech useful and practical. The most important of these is the programming standard Microsoft Speech Application Programming Interface (SAPI) for Windows. SAPI standardizes the way authors control TTS voices: starting and stopping the voice, controlling its speed, volume and pitch, and its flow with silent delays. Manufacturers of SAPI-standard voices implement the SAPI controls in the voice software and developers of speech applications program SAPI controls into their applications to let the user control any SAPI-standard voice.

To control the properties and flow of the voice, SAPI provides a XML markup language, also called speech tags, which is added to the input text to communicate to the voice processor actions to take when converting the text to speech.

Some examples:

1. Volume - The Volume tag controls the volume of a voice on a scale of 0:100. The voice will change volume at the point it encounters the tag.

This text should be spoken at volume level 100.

<volume level="50">

This text should be spoken at volume level fifty.

</volume>

2. Rate - The Rate tag controls the rate (speed) of a voice on a scale of -10:10. The voice will change speed at the point it encounters the tag.

This text should be spoken at rate 0.

<rate absspeed="3"> This text should be spoken at rate 3.

<rate absspeed="-3"> This text should be spoken at rate -3.

</rate> </rate> Heather

The Pitch tag works the same as the Rate tag.

3. Emphasis - The Emph tag instructs the voice to emphasize a word or section of text.

<emph> boo </emph>!

Use the Emph tag to determine the prosody of an ambiguous sentence, for example the one referred to in the first post.

“He reads well” Paul
“He reads well” Paul
“He reads well” Paul

4. Silence - The Silence tag inserts a specified number of milliseconds of silence into the output audio stream.

Five hundred milliseconds of silence <silence msec="500"/> just occurred.

This is a very important tag for the naturalness of the voice.

5. Pronounce - The Pron tag inserts a specified pronunciation using the SYM phonetic language. Here is “Hello world” in SYM.

<pron sym="h eh 1 l ow & w er 1 l d "/>

This tag lets you instruct the voice how to say highly technical words and company slogans. See the first post for an example.

6. The PartOfSp tag lets you resolve the part of speech of a word.

Notes:

· Not all voices have all the tags implemented, for example, Heather does not have an emph tag.

· The NeoSpeech voices in Captivate do not use the SAPI tags but rather a proprietary markup language, VTML. Speech-Over works with SAPI-standard voices only.

· For more info about SAPI and its markup language, download sapi.chm from here.

Automating the markup language – SAPI editor

Clearly, having to type in or even paste these XML tags into the input text is time-consuming and error-prone. This is another case where a small innovation is called for: as discussed above, Speech-Over Professional has a SAPI editor that represents XML tags with simple text symbols - which makes it very easy and error-proof to insert and manipulate speech tags in the input text. Speech-Over Professional also automates the Pron tag with its Pronunciation lexicon you can use to add highly technical terms and company slogans.

Bottom Line

You may be thinking that some of the cost savings that you get from using TTS as compared to human voice talent is lost in this effort and that’s true. However, the rework aspect is still substantially less. Again, the best comparison is that of going through a recording session with a script. That process is very similar to what you end up with doing punctuation and markup with text to get the TTS voice to be much improved for eLearning.

For me personally, this is still not the same quality as a good voice talent, but it is definitely a lower cost and has MUCH lower cost in the face of change. It’s a good balance in many situations.

8 comments:

Anonymous said...: Looks like this might be interesting, but all of the audio links go to "page not found".; 10/20/2010 03:45:00 PM
Tony Karrer said...: My original post had incorrect links. This page should be correct now - but emails and RSS feeds will still be wrong. Sorry.; 10/20/2010 03:57:00 PM
Unknown said...: I have used synthetic sound at two places

a. Structural
Eng video

and here

b. "Degree
of Freedom E-Lecture with Animated Agent"
Computer generated sound is OK but as you mentioned it is no
replacement for real human voice yet.
These sounds were created
without Sapi mark ups. I have tried some with
some Sapi mark ups but the improvement was not worth the efforts so I
have stayed away from this approach.; 10/20/2010 05:40:00 PM
Tony Karrer said...: Javed - interesting examples. Thanks for sharing. What voices did you use for that?; 10/23/2010 05:39:00 AM
Unknown said...: Tony

It was done few years ago and I used neospeech version 1.0

Now I have Neospeech 2.0 that is I think Sapi 5 compliant

Here is another example

Animated Characeter; 10/24/2010 04:47:00 PM
zillustration said...: I am curious if there is an XML tag to change the voices as with the Apple/Windows VoiceOver. I would like to have a Announcer tag (deep voice), Female 1 (mom), Female 2 (daughter), Male 1 (father), and Descriptive Narrator (sample; "Father walks to table and picks up his coffee cup...")

Are there XML tags to do such switches within the OS from a text doc? Just wishful thinking?

continuing to dig.; 11/09/2010 07:21:00 AM
Mike Harrison said...: Perhaps eventually TTS can be made to sound more human. But adding commas and other punctuation plus editing html code to vary the volume of syllables and words is a whole lot of extra work to produce what an experienced human speaker can do right off the bat, in far less time.

I don't see how anyone could sit still and listen intently (which is what learning requires) for more than just a couple of minutes to a 'voice' that has either little inflection or inflection on the wrong words and syllables. And even if someone was able to listen to synthesized speech for any length of time, how much of what was heard will be retained?

It is my opinion - especially for the U.S. with its alarming education scores, that far too much fiddling around trying to 'perfect' TTS technology for instruction is placing more emphasis on the 'e' rather than on 'learning.'

Too many people (beyond those with a microscopic attention span) are not motivated to learn new things even in a real classroom because there is nothing to create and maintain a genuine interest. Are we to understand that dull, uninspired synthesized speech is better?

Adding punctuation and futzing with html code in hopes of making synthesized speech sound more appealing is akin to reinventing the wheel. If I'm hungry, I know there are plenty of frozen foods available. But if I want a satisfying meal, I'm not going to the freezer.

We should all remember from school: a boring subject coupled with an even less interesting teacher was the formula for failure. Rather, if we improve the quality of the writing of the subject matter (lessons) to the point where it is actually interesting, and have the lessons delivered by an experienced speaker with the ability to reach and hold an audience, we'd be on the road to making education better.; 1/19/2011 06:05:00 AM
Morrison said...: Interesting article, however have seen an error, and a proposed change:

1) "The Pitch tag works the same as the Rate tag." is incorrect. Pitch is frequency, rate is speed. Try set oTTS = createobject("SAPI.SpVoice") : oTTS.Speak "This is a test" : oTTS.Speak "This is a test" : oTTS.Speak "This is a test" (VBS) as a demonstration.

2) The tags, as shown above, should probably be ended in the same tag, instead of nested like the examples.; 10/31/2012 05:44:00 AM