Tony Karrer's eLearning Blog on e-Learning Trends eLearning 2.0 Personal Learning Informal Learning eLearning Design Authoring Tools Rapid e-Learning Tools Blended e-Learning e-Learning Tools Learning Management Systems (LMS) e-Learning ROI and Metrics

Tuesday, September 07, 2010

Text-to-Speech eLearning Tools - Integrated Products

This is fourth post in a series on Text-to-Speech (TTS) for eLearning written by Dr. Joel Harband and edited by me (which turns out to be a great way to learn).  The other posts are: Text-to-Speech Overview and NLP Quality, Digital Signal Processor and Text-to-Speech, and Using Text-to-Speech in an eLearning Course.  

If this topic is of interest, then also check out the Big Question this month: Voice Over in eLearning.

In this post we’ll discuss some really useful stuff, text-to-speech tools that are integrated with an authoring solution.  These products promise to automate the process of adding audio to eLearning thereby streamlining and accelerating the production of eLearning courses. We’ll look at:

Requirements for a TTS Product

First, let's set down the requirements that eLearning professionals would expect from a production TTS tool and see how these two products fulfill them.

The first requirement is obvious:

  1. TTS voices with audio distribution license, which are of acceptable quality for eLearning applications, should be provided.

As we mentioned in the first post, the TTS voice is a major advance in audio technology but it needs a host of minor innovations to make it usable and efficient, which lead to the further requirements:

  1. TTS operations should be integrated with an authoring tool so that it is easy to add voice content to a visual presentation and have it spoken when individual slides are displayed, or spoken in synch with screen object animations such as successive bullets appearing. Sound file operations should be transparent to the user.
  2. Subtitles should be automatically created from the input text, formatted and coordinated with the speaking voice. Subtitles are important both for accessibility requirements and to enhance understanding of the voice content.
  3. Easy to update and change the voice content and subtitles to keep presentations up-to-date. This is important for retaining the value of the presentation.
  4. Easy to modulate the voice. Voice modulation adds clarity and realism by introducing silent delays, word emphasis and speed and pitch changes that can make a monotonous voice come alive. Voice modulation is achieved by introducing modulation instructions (tags) into the text flow. The tool must make this very easy and intuitive. We'll discuss this point in the next post.
  5. Support for correctly pronouncing highly technical words or company slogans or expressions.
  6. Background music. Adding suitable background music can support and enliven the TTS voices.

Let's look at how the tools stand up to requirements 2-4.


Captivate 4 introduced a TTS feature for adding slide narration and had NeoSpeech’s Paul and Kate voices built-in. Captivate 5 added several Loquendo voices as well as access to any voices that are installed on the computer.

Adding voice content

Captivate lets you enter narration text for TTS voices through its slide notes pane. Each line of notes is entered and stored separately. Any note line can be associated with a TTS voice and a narration sound file generated from it. Multiple note lines on the slide can be associated with different TTS voices and the narration sound files generated will play in sequence when the slide is displayed. In case you need to coordinate the voice sound with screen animation, a time-line editor is provided.


Captivate lets you create and display subtitles (closed captions) from the same notes text lines you used for the TTS narration. You need to manually synch the duration of the subtitles display with the voice sound. Long subtitles would need to be broken up manually and entered as separate note lines.

Changing voice content

To make changes in the voice content, change the notes text lines and regenerate the sound files. If the sound length changes, you will need to re-synch the voice, the subtitles, and the screen animation.


The screen shot below shows a Captivate slide with three lines of note text. Each line has been used to produce narration using TTS and to produce a closed caption subtitle (1st and 2nd check boxes respectively), that is, three separate sound files play with subtitles as this slide is displayed. The lower text animation box is the screen title that appears in synch with the second sound file. The timing was determined by the time-line editor and set manually.


The screen shot below shows the Speech Management panel. It shows how each note line can be associated with a different voice to produce a separate sound file.


The screen shot below shows the Closed Captioning panel, which lets you use time-line editing to synch the duration of the closed captions (subtitles) with the speaking voices, as indicated. This time-line editor was also used to determine the start time, 8.6 secs, for the screen title animation.


In summary, it is possible to use Captivate to achieve a combination of multiple screen animations, TTS sounds, and subtitles on a slide, with a process of manual synchronization using time-line editing.

Speech-Over Professional

Tuval Software's Speech-Over Professional 4 works with Microsoft PowerPoint as an add-in. PowerPoint is the most popular tool for producing e-learning presentations, either by itself or together with other e-learning tools.

Speech-Over comes bundled with NeoSpeech Paul and Acapela Heather or with NeoSpeech Paul and Kate and will recognize any voice installed on the computer.

Speech-Over is well-integrated with PowerPoint and creates, combines and synchronizes voice media effects, subtitle effects and screen object animation effects by working directly with PowerPoint APIs. Synchronization is automatic; time-line editing is not required.

Adding voice content

The narration text for the TTS voices is input directly through a dialog box within PowerPoint. The text can be spoken when individual slides are displayed or spoken in synch with screen object animations like successive bullets. Speech-Over adds the screen object animations, if none have been defined. TTS voices are selected by a pre-defined voice scheme so there is no need to choose the TTS voice for each text input. Speech-Over creates slide notes from the TTS text.


Speech-Over automatically produces subtitle effects from the input text, formats and synchronizes them with the speaking voice. Long subtitles are automatically broken up and displayed in succession.

Changing voice content

The text content is edited through the same type of dialog by which it was entered. Alternatively, you can edit all text on a slide on a single dialog. The sound media effect, subtitle effect and animation effect are all regenerated and automatically synchronized without any need for time-line editing. You can also re-order narration clips and copy and paste them between screen objects.


Let’s see how the same example is done using Speech-Over without any time-line synchronization.

The screen show below shows how the first text line is entered in the dialog. The screen background was selected previously so that a “slide” narration clip is created which will play when the slide is displayed. The Acapela Heather voice is used. The third text line is entered in the same way.


The screen shot below shows how the second text line is entered in the dialog. This time the screen title was selected previously so that the sound file will automatically play when the screen title animates, - where the title animation effect is added by Speech-Over. The Paul voice is used for this text.


The screen shot below shows some useful Speech-Over dialogs: the Slide Clip Content Editor, which lets you edit the text content of all narration clips on the slide, and the Clip Organizer, which displays narration clips as rows. The row order is the clip playing order, which can be easily changed by the up/down arrows.

All changes are automatically re-synched.


In summary, it is possible to use Speech-Over to achieve a combination of multiple screen animations, TTS sounds, and subtitles on a slide, with automatic synchronization.


Both Captivate and Speech-Over fulfill the requirements 2-4. For the examples given, Speech-Over is more efficient, especially for updates and maintenance, because it synchronizes voices, subtitles and screen object animations without time-line editing and automatically subdivides subtitles. For the simple case of one text line for a static slide, where no synchronization or subtitle division is required, the two tools would be similarly effective for these requirements.

In the next post we'll discuss the requirements of voice modulation and pronunciation for these products.


Anonymous said...

To this you might add Adobe Presenter. Although I am a mere school district curriculum director in social studies (ret.), Presenter has provided one of the best ways to create powerful and versatile district-wide review materials. Each slide is narrated on its own and it is easy to correct narration errors. Digital video can be easily inserted. video of the narrator (or better yet a digital avatar) can be inserted. Finally, a quick quiz can be easily built.

For me, it's the combination of Captivate and Presenter that are my favorite choices. The flash and the PDF work across platforms, a necessity in a district with Mac in elementary and PC in secondary.

Anonymous said...

Hi Tony,

I'm an English teacher at HCT, UAE and I'm looking for a speech recognition software for my students with which I could record vocabulary for the students to repeat, which they in turn could record and receive feedback about the correctness of their pronunciation.

The only similar 'pronunciation gauge' that I have come across with is the speech analysis tool of the Spanish version of the Learn To Speak DVD ROM, which gives feedback to students at three levels: incomprehensible, tourist or native like.

Do you anyone else in the community know of a similar, preferably open source tool available online? Audacity's interface is too complicated and evaluating their pronunciation from the sound wave of their recording is more than I think they'd be able to or care to do.

I'd appreciate any suggestions.

Thanks, Eva