Tony Karrer's eLearning Blog on e-Learning Trends eLearning 2.0 Personal Learning Informal Learning eLearning Design Authoring Tools Rapid e-Learning Tools Blended e-Learning e-Learning Tools Learning Management Systems (LMS) e-Learning ROI and Metrics

Tuesday, September 14, 2010

Text-to-Speech vs Human Narration for eLearning

Some challenging questions are being raised in this month’s Big Question - Voice Over in eLearning.  Some of the key questions:

  • Given the range of solutions for voice-over from text-to-speech, home-grown human voice-over, professional voice-over: how do you decide what's right for your course?
  • How do you justify the budget and how does that factor into your choice of solution?
  • Are there places where text-to-speech makes sense?

This post is part of the series on Text-to-Speech (TTS) for eLearning written by Dr. Joel Harband and edited by me.  The other posts are: Text-to-Speech Overview and NLP Quality, Digital Signal Processor and Text-to-Speech, Using Text-to-Speech in an eLearning Course, and Text-to-Speech eLearning Tools - Integrated Products.  

We attacked these questions a little differently than the big question.  We particularly focused on:

  1. Why use Text to Speech (TTS)?
  2. Why not use human voice-over? Or just use text on the screen?
  3. How will the quality of the voice affect the quality of the learning? How will the students accept the voices?

To best answer these questions, we asked professionals who have had actual experience in the field: people that have produced eLearning courses with text to speech tools (Speech-Over Professional) and have received feedback from learners.

You can think of this as four case studies of Text-to-Speech.  The case studies come from:

  • Case Study A. Company-wide training modules by an IT Process & Quality Manager at a Large Global Communications Corporation. 
  • Case Study B. Global web training by a Systems Engineering Manager at a Large Product Corporation.
  • Case Study C. Company-wide training modules by a Lead Courseware Developer at a Security Products Corporation.
  • Case Study D. Support for live presentations by a Process Design Consultant.

Why Use Text-to-Speech?

There were a range of answers to the question:

A. Our company has a prior background in TTS - our phones use TTS - and we've tried TTS before for training. This time it is succeeding because of the price, the voice quality, and the integration with PPT. I think it will only get better with time.

The reasons we use TTS are three fold:

* E-learning with voice-over is a preferred training approach within our company. This allows for people to take their training at their own pace; when and where they want to take it. Voice over is very helpful for our associates that English is not their first language.

* Using human voices makes it more difficult to create and maintain the training. Only a few people have the quality voice with minimal accent to perform the recordings. This creates a resource constraint for the creation and maintenance of e-learning material. Usually, the e-learning was out-of-date with the subject of the training and quickly became obsolete.* Voice over, especially computer voice, has proven to be helpful to associates that English is not their first language. The computerized voice is more consistent in pronunciation and speaks at a more steady pace. Thus, allowing people to understand the material more easily.

B. It offers a significant advantage over other methods of providing audio with PowerPoint.

C. We were looking for something that provided us with a short production and turn-around time, that our small development team could do in-house. Something easy to edit and change on the fly, without having to send it out, or schedule lengthy voiceover work.

These responses echo what we generally expect, Text-to-Speech offers a solution that is much faster to produce and 100x faster to modify as changes occur.  This means a faster time-to-market and lower cost than human narration.  There are obviously ways to keep human voice-over costs down by using in-house talent, but it still takes significantly more time.  And it’s especially true when changes occur.

If you think about a simple spectrum of solutions:

Text on Screen
No Voice-Over
Human Narration
Human Narration
Lowest Cost
    Highest Cost

Certainly there’s a balance to be found.  We’ll consider other factors below.

I thought the response from Case Study D was particularly interesting:

D. Initially, I experimented with TTS as a way to add content to a presentation that I as a presenter could use to refresh myself before presenting. I found that the act of adding TTS made me aware of a number of design issues with the presentation. Then I thought: wouldn't this be great as a way for participants to refresh their knowledge after the training.

One of the recommendations around the use of Text-to-Speech is that it’s used as part of any course that eventually will be recorded by Humans to prepare the script as part of authoring.  That way, you have a good idea what it will sound like once it’s recorded.  In this case, they were using Text-to-Speech to prepare themselves for a presentation.  But instead of recording themselves against the slides, they used TTS.  They could easily listen to their script.  That’s actually a fantastic idea.  And it led this person to eventually use the TTS as the basis of creating courses that could be used after the training sessions.

Why didn't you use human voice-over?

Obviously, cost and time are a major factor here.  But a lot of the specific reasons have more to do with a hassle factor of using voice talent.  Here were the responses:

A. Mainly for updating where I don't have to look for the original voice talent who can now charge more. We don't have voice talent available internally.

B. Publishing a straight recording keeps all of the errors of the subject matter expert, speaking too fast, low sound quality, running on or off topic. Maintaining the recorded voice requires an entire rerecording and production where TTS is much simpler.

C. For our first project, we did use human voice-over as well as text. We found that the added production time, and having to schedule around voice over, plus re-doing entire segments for one small correction, to get the sound to match, was prohibitive both cost and time-wise.

D. I don't have a particularly great voice for adding to the slide so that's a factor. But the other factor is that it's 100 times easier to change text than re-record speech. Even if I were to record speech I would first do a TTS and then only after I believed it to be final, might I record.

Anyone who has used in-house or professional talent knows about the hassle factor of getting things done.  You often find yourself not doing retakes when something is wrong or there are changes just because it’s too much work.  Even when you do your own voice-over, there’s still more time involved.  So adding to the spectrum above:

Text on Screen
No Voice-Over
Human Narration
Human Narration
Lowest Cost
Easy to Change
Lowest Hassle
    Highest Cost
Hard to Change
Biggest Hassle

Why didn’t you just use Text on Screen?

I think some of the other responses to the Big Question address this much better – why use voice-over at all?  But a couple of the reasons from these case studies have to do with providing support to ESL learners:

B. We asked our students which helps them learn; subtitles only or subtitles with speech. They agreed that subtitles with speech are better. English as second language students even said it helped them learn English.

C. Since our training modules are used world-wide, in English, we wanted voice as well as text (all our training modules have both). Many foreign students have much better vocal/listening comprehension vs. just reading comprehension, if English is not their first language, so having voice as well as text was important to us.

I would highly recommend looking at some of the specific answers to Voice Over in eLearning that talk to issues of when to use voice-over in eLearning.  For example:

  • Learning environment – some environments audio is not good.  In other cases, it’s great to have audio to add engagement.
  • To support graphics or animations on screen – large amounts of text would be distracting.

I will caution you that some of the responses suggest that Voice-Over roughly equates to slower learning with no improved effect; and limits your cultural appeal.

There’s also some suggestion that the script should be available with a mute button to be read by learners who prefer that modality.  I would claim this would definitely argue for Text-to-Speech.

Others argue that to capture emotion and to engage, voice-over is very important.

So, my spectrum table becomes woefully inadequate to capture all of this. Anyone want to take a shot?

Concerns About Quality?

In each case, there was concern about quality, but the result was good enough, especially with caveats to be used.  I think the responses speak (pardon the pun) for themselves.

A. For many English speaking associates, the computerized voice can be very boring and mundane. When we researched TTS about 5 years ago the higher quality voices were too expensive. Today, those same voices are much less expensive and have broken that barrier of being too "computerish". Training the voices is an important issue. The support provided by Speech-Over for modulation and pronunciation is good.

B. We were concerned that it would be too mechanized sounding. It turned out not to be and was well accepted by students.

C. Yes, we were concerned that the slight robotic cadence might detract from the training, just because it does not come out completely natural all the time. The Paul voice is very good, but still recognizable as mechanical. To counter this, we put a statement up front in our training introduction about the narration being computer generated, so an awareness and expectation of this is set with the students before they even begin the training. With this disclaimer in place, we have had no complaints at all about the "voice" in the presentations, and our technical training modules using this TTS have been successfully taken by hundreds of students world-wide as part of their technical service training with us. As we worked with the TTS, we quickly developed a style of writing the scripts that really worked well with TTS, and minimizes the difference between using a computer generated voice, vs. human voice-over. In fact, we received complaints about our first human voice-over training for a few pronunciation gaffs, and some pacing issues, where we have received none at all on our subsequent TTS developed training modules.

D. The voice quality is extremely important. As soon as people hear what sounds like a robot voice they tend to immediately believe the presentation to be cheap like the voice. So voice quality is the key. The current voices although very good are more monotonous than a human voice. I know that there are some tools for changing Paul's voice, for example, but I haven't tried them.

Results?  Acceptance by Students?

Again, the responses are somewhat self-evident:

A. Yes. The TTS technology coupled with the software allowed us to create e-learning material in about half the time as human voice over. The maintenance of the e-learning material takes 75% less time than maintaining material with human voice over. This allows us to create and maintain material much faster with less resources and without needing specialized resources that have voices specialized for recording.

We have produced courses for 6000 people in the company and we are getting good feedback: 80% are satisfied, 10% love it and 10% feel offended. My conclusion is that the voices are "good enough" for training applications.

B. Yes. It actually helped us reduce the length of training by having the subject matter experts edit their transcripts and eliminate extra unnecessary speech.

C. Yes and more. The ease of converting the text to voice, coupled with the ability to go back and instantly change / edit / correct narration on a single slide, and have it exactly match the voice, volume, timber, etc. of every other slide, recorded days or weeks or months earlier is invaluable. Short technical/repair training modules that took us a month or more to develop and schedule voice-over and re-voice-over to correct and edit, now literally take us just days to develop start to finish, right on the desktop. Acceptance by the students has been 100%. All the students taking our TTS based training are required to pass a Certification test after they complete those training modules. Our first-time pass rates are identical for our earlier human voice-over training, vs. our current TTS based training - so if outcomes are the measure, for us, there is no difference between the two as far as their functional performance, and the Return On Investment is much higher for us with the TTS. In surveying students who completed our TTS based training, they all said the same thing, that at first it was a bit different, being computer generated narration, but after they were into the training their ear became tuned to the voice, and it really wasn't any different than listening to someone talk who had a particular regional or foreign accent to their speech.

The comment about learners getting used to the voice is interesting.  I think putting a caveat up front and then learners getting used to the voice is an important take-away.


Obviously, there are complex questions around the use of voice-over at all.  These are hard to capture in the simple kind of spectrum table that I attempted above.  Some specific things that jump out at me:

  • The TTS voice quality was acceptable for eLearning applications and did not detract from learning effectiveness.
  • High Emotion - Clearly if you have sensitive material with high emotion, likely using actual voices (key executives or employees) might be best.  Professional talent can also help with this. 
  • Text-to-speech accelerates development time vs. human voice-over.  And maintaining the voice is possible.
  • Much of the comparison of Text-to-speech vs. Human narration focuses on the hassle factor more than cost.
  • Text-to-speech makes it easy to keep the material up-to-date and accurate vs human recordings that can become obsolete and would need to be re-recorded.
  • Caveat Text-to-Speech – Put a note up front so that learners are more open to the voice.
  • Use Text-to-Speech to prepare your scripts
  • If you expect change, don’t use human narration

I welcome your thoughts and comments.

Tuesday, September 07, 2010

Text-to-Speech eLearning Tools - Integrated Products

This is fourth post in a series on Text-to-Speech (TTS) for eLearning written by Dr. Joel Harband and edited by me (which turns out to be a great way to learn).  The other posts are: Text-to-Speech Overview and NLP Quality, Digital Signal Processor and Text-to-Speech, and Using Text-to-Speech in an eLearning Course.  

If this topic is of interest, then also check out the Big Question this month: Voice Over in eLearning.

In this post we’ll discuss some really useful stuff, text-to-speech tools that are integrated with an authoring solution.  These products promise to automate the process of adding audio to eLearning thereby streamlining and accelerating the production of eLearning courses. We’ll look at:

Requirements for a TTS Product

First, let's set down the requirements that eLearning professionals would expect from a production TTS tool and see how these two products fulfill them.

The first requirement is obvious:

  1. TTS voices with audio distribution license, which are of acceptable quality for eLearning applications, should be provided.

As we mentioned in the first post, the TTS voice is a major advance in audio technology but it needs a host of minor innovations to make it usable and efficient, which lead to the further requirements:

  1. TTS operations should be integrated with an authoring tool so that it is easy to add voice content to a visual presentation and have it spoken when individual slides are displayed, or spoken in synch with screen object animations such as successive bullets appearing. Sound file operations should be transparent to the user.
  2. Subtitles should be automatically created from the input text, formatted and coordinated with the speaking voice. Subtitles are important both for accessibility requirements and to enhance understanding of the voice content.
  3. Easy to update and change the voice content and subtitles to keep presentations up-to-date. This is important for retaining the value of the presentation.
  4. Easy to modulate the voice. Voice modulation adds clarity and realism by introducing silent delays, word emphasis and speed and pitch changes that can make a monotonous voice come alive. Voice modulation is achieved by introducing modulation instructions (tags) into the text flow. The tool must make this very easy and intuitive. We'll discuss this point in the next post.
  5. Support for correctly pronouncing highly technical words or company slogans or expressions.
  6. Background music. Adding suitable background music can support and enliven the TTS voices.

Let's look at how the tools stand up to requirements 2-4.


Captivate 4 introduced a TTS feature for adding slide narration and had NeoSpeech’s Paul and Kate voices built-in. Captivate 5 added several Loquendo voices as well as access to any voices that are installed on the computer.

Adding voice content

Captivate lets you enter narration text for TTS voices through its slide notes pane. Each line of notes is entered and stored separately. Any note line can be associated with a TTS voice and a narration sound file generated from it. Multiple note lines on the slide can be associated with different TTS voices and the narration sound files generated will play in sequence when the slide is displayed. In case you need to coordinate the voice sound with screen animation, a time-line editor is provided.


Captivate lets you create and display subtitles (closed captions) from the same notes text lines you used for the TTS narration. You need to manually synch the duration of the subtitles display with the voice sound. Long subtitles would need to be broken up manually and entered as separate note lines.

Changing voice content

To make changes in the voice content, change the notes text lines and regenerate the sound files. If the sound length changes, you will need to re-synch the voice, the subtitles, and the screen animation.


The screen shot below shows a Captivate slide with three lines of note text. Each line has been used to produce narration using TTS and to produce a closed caption subtitle (1st and 2nd check boxes respectively), that is, three separate sound files play with subtitles as this slide is displayed. The lower text animation box is the screen title that appears in synch with the second sound file. The timing was determined by the time-line editor and set manually.


The screen shot below shows the Speech Management panel. It shows how each note line can be associated with a different voice to produce a separate sound file.


The screen shot below shows the Closed Captioning panel, which lets you use time-line editing to synch the duration of the closed captions (subtitles) with the speaking voices, as indicated. This time-line editor was also used to determine the start time, 8.6 secs, for the screen title animation.


In summary, it is possible to use Captivate to achieve a combination of multiple screen animations, TTS sounds, and subtitles on a slide, with a process of manual synchronization using time-line editing.

Speech-Over Professional

Tuval Software's Speech-Over Professional 4 works with Microsoft PowerPoint as an add-in. PowerPoint is the most popular tool for producing e-learning presentations, either by itself or together with other e-learning tools.

Speech-Over comes bundled with NeoSpeech Paul and Acapela Heather or with NeoSpeech Paul and Kate and will recognize any voice installed on the computer.

Speech-Over is well-integrated with PowerPoint and creates, combines and synchronizes voice media effects, subtitle effects and screen object animation effects by working directly with PowerPoint APIs. Synchronization is automatic; time-line editing is not required.

Adding voice content

The narration text for the TTS voices is input directly through a dialog box within PowerPoint. The text can be spoken when individual slides are displayed or spoken in synch with screen object animations like successive bullets. Speech-Over adds the screen object animations, if none have been defined. TTS voices are selected by a pre-defined voice scheme so there is no need to choose the TTS voice for each text input. Speech-Over creates slide notes from the TTS text.


Speech-Over automatically produces subtitle effects from the input text, formats and synchronizes them with the speaking voice. Long subtitles are automatically broken up and displayed in succession.

Changing voice content

The text content is edited through the same type of dialog by which it was entered. Alternatively, you can edit all text on a slide on a single dialog. The sound media effect, subtitle effect and animation effect are all regenerated and automatically synchronized without any need for time-line editing. You can also re-order narration clips and copy and paste them between screen objects.


Let’s see how the same example is done using Speech-Over without any time-line synchronization.

The screen show below shows how the first text line is entered in the dialog. The screen background was selected previously so that a “slide” narration clip is created which will play when the slide is displayed. The Acapela Heather voice is used. The third text line is entered in the same way.


The screen shot below shows how the second text line is entered in the dialog. This time the screen title was selected previously so that the sound file will automatically play when the screen title animates, - where the title animation effect is added by Speech-Over. The Paul voice is used for this text.


The screen shot below shows some useful Speech-Over dialogs: the Slide Clip Content Editor, which lets you edit the text content of all narration clips on the slide, and the Clip Organizer, which displays narration clips as rows. The row order is the clip playing order, which can be easily changed by the up/down arrows.

All changes are automatically re-synched.


In summary, it is possible to use Speech-Over to achieve a combination of multiple screen animations, TTS sounds, and subtitles on a slide, with automatic synchronization.


Both Captivate and Speech-Over fulfill the requirements 2-4. For the examples given, Speech-Over is more efficient, especially for updates and maintenance, because it synchronizes voices, subtitles and screen object animations without time-line editing and automatically subdivides subtitles. For the simple case of one text line for a static slide, where no synchronization or subtitle division is required, the two tools would be similarly effective for these requirements.

In the next post we'll discuss the requirements of voice modulation and pronunciation for these products.