|
(Updated in 2009 to list a Google Voice example in the references)
(Back to the Choices page)
We at TypeWell are tremendously interested in speech recognition as a possible way of providing communication access.
A readily-available, simple tool for general use would perfectly fit our goal of providing the best possible services for our customers.
And of course our technical people, like techs everywhere, think the technology is neat and would love to use it, just because.
You can be sure that as soon as speech recognition is finally good enough for real-time use in classes and meetings, we'll have a TypeWell product based on it!
We tell you this so you can put the information below in context.
We wish it were otherwise, but unfortunately speech recognition is not yet ready for most of us.
The Promise
Automatic Speech Recognition (ASR) seems the perfect technology for classroom communication access.
It would have an instructor wear a small microphone, and a computer would automatically transcribe his or her speech into text on a screen for the student to read-- what could be simpler?
Unfortunately, for now, the reality is far from this promise.
Here we discuss the current reality of ASR for classroom use.
Topics covered:
The Classroom is a Tough Environment
As of 2007, the best speech recognition systems are based on two "speech engines":
Dragon NaturallySpeaking, and
ViaVoice.
The systems you've heard of in the classroom speech-recognition-to-text field are based on one or the other of these two systems.
Specific classroom systems add some features on top for convenience, but they don't change the fundamental accuracy (or inaccuracy) of the underlying speech-analysis engine.
Dragon claims an accuracy of 99%.
That sounds pretty good!
Does that mean these systems are accurate enough for classroom use?
Unfortunately not.
They can achieve the advertised accuracy under the best conditions: in a quiet room, where a male speaker with good diction reads prepared text at a measured pace.
The classroom environment, with spontaneous speech, background noise, and movement of the speaker, is much more challenging for ASR.
Return to Top of Page
Teacher to Transcript
The simplest solution for providing communication access would be if we could put a mic on the teacher and transcribe his or her speech directly.
Unfortunately, this doesn't work, for two reasons.
First, for best results one must speak to the computer in full sentences with an even tone (i.e., like a robot rather than expressively).
But a teacher speaks for the students, not the computer.
The teacher's use of emotion, colloquialism, and the natural ungrammaticalness of spontaneous speech, all combine to make ASR accuracy very low on this type of speech.
Second, to make a readable transcript one must speak the punctuation, pronouncing "period," "question mark," and "new paragraph" at the right points.
Here is an actual ASR example (from an excellent article by Ross Stuckless, see reference below).
For this presentation all word errors have been fixed, so that we can focus on the issue of formatting.
Note that even with artificially high 100% word accuracy, this transcript is hard to understand:
why do you think we might look at the history of the family history tends to dictate the future okay so there is some connection you're saying what else evolution evolution you're on the right track which changes faster technology or social systems technology.
Here's the same passage with the missing formatting information added (by a transcriber):
Instructor: Why do you think we might look at the history of the family?
Student: History tends to dictate the future.
Instructor: Okay.
So there is some connection, you're saying.
What else?
Student: Evolution.
Instructor: Evolution.
You're on the right track.
Which changes faster, technology or social systems?
Student: Technology.
Return to Top of Page
Voicewriters
The only way to produce a transcript like the above out of a speech recognizer, is to use a voicewriter: a person who repeats the words of the teacher and students, and in the way ASR needs them spoken.
This greatly improves recognition accuracy, as the voicewriter can speak carefully, in full sentences, at a measured pace, and with an even tone, and can add formatting commands.
Voicewriting isn't easy.
One reason is that one can't just speak out loud in the classroom.
So, voicewriters must speak into a muffler, called a stenomask.
Furthermore, they must speak in a special way, not a whisper but at low volume, sort of an intelligible mutter.
This prevents the buzzing of speech that escapes the muffler from bothering the class.
Another reason voicewriting is hard is that due to distortion of sound caused by the stenomask, plus the lower intelligibility of the muttered speech, the speech recognition accuracy drops significantly.
The solution is to learn to mutter clearly, and to teach the speech recognizer to translate mutterings more accurately.
Both are possible, but doing so takes months or years of practice, and many speakers can never overcome the inherent difficulties of the process.
To achieve the necessary accuracy, voicewriters must develop skill using a special vocabulary with "words" like "their-po" for the word "their" to make it clear to the speech recognizer which spelling of "there/their" was intended (the "po" is short for "possessive"), or like "spee-one" to indicate that "speaker #1 is about to speak" to signal formatting changes (the "spee" is short for "speaker").
There is a small group of professional voicewriters who have been able to develop the special speaking skills necessary, and achieve speed and accuracy results similar to stenotypy (court reporters).
These professionals have typically spent years in training themselves and their speech recognizer programs to get the acceptable accuracy despite the imposition of the stenomask. Many others have tried, but had to give up due to the lack of sufficient progress in improving accuracy levels.
The bottom line is that it is possible for some people to achieve 98% accuracy, after a year or more of persistent training.
In effect these skilled professionals are equivalent to stenotypists.
Their prices remain comparable to CART (stenotypists) because of the length of their training.
Attempts by some of these successful voicewriters to teach others the "tricks of the trade," and shorten training time significantly, have not been widely successful. And, training and staffing with one's own "casual" voicewriters is currently ineffective because acceptable error rates are achievable only with extended training, and only by some people.
Return to Top of Page
When Does 1 Error = 5 Errors?
Why is it necessary to push for 98% accuracy?
If we could settle for, say, 92% accuracy, it might take months rather than years to learn to voicewrite, and cost of the service could come down.
At 98%, only 2 words are wrong out of every 100 words.
That's way better than one really needs, isn't it?
The reason every error is significant is that a speech recognition error is not like a typing mistake.
Rather than a missing or extra letter, a completely different word is substituted.
Some real-life examples:
| What was Said | | What was Recognized |
| that's speech recognition | => | that's peach wreck in kitchen |
| senior years | => | seen your ears |
| today it's not lawful | => | today it's awful |
| doing cell addressing | => | doing Excel addressing |
| it can't work | => | it can work |
| it's biological | => | it's a biologic call |
|
Some errors have a high "giggle factor", such as the first two, above.
Surprisingly, such errors are not the most troublesome, since the recognition is so wrong that the reader can be expected to ignore the sentence entirely.
More serious are the common errors where the recognition is not obviously wrong, such as the "Excel addressing" example, above.
The reader might not realize that the information is incorrect.
The worst errors are those where the transcription is the opposite of the intended meaning, such as the "it can't work" example above.
Incorrect and opposite information impair student learning.
Like all transcription systems, ASR should be judged by how much meaning is preserved.
The measurement standard for sign-language interpreters and TypeWell transcribers is based on meaning.
Meaning is the measure of how much a transcription will help the student.
On the meaning scale ASR fares poorly.
One ASR error can destroy the meaning of an entire sentence, very different from a typing error.
Usually the meaning of even the word containing the typo can be determined from context.
On the other hand, an ASR word error is often undecipherable even using context.
Each ASR word error can obliterate the meaning of the entire 10-15 word sentence it is in.
It's not always that bad, but on average, each ASR word error damages the meaning of about 5 words.
This means that one word error in a 100-word paragraph damages about five words.
A 1% word error rate turns into a 5% meaning error rate.
Similarly, 8 ASR word errors in a 100-word paragraph damage the meaning of 5x8 = 40 words of that 100-word paragraph.
Therefore, an ASR word accuracy of 92% (8% error) corresponds to a meaning accuracy of only 60% (40% error) -- this seems unbelievably low unless you've seen ASR for yourself in the classroom.
It really is that bad.
An ASR word accuracy of 98% is necessary in order to preserve a reasonable minimum of 90% of the meaning, and not have too many sentences with misleading information.
Return to Top of Page
How Much Longer?
ASR technology is improving rapidly.
Won't ASR accuracy be good enough for voicewriting by "regular folks" within a year or two?
Although we wish that were so, the answer is no. That's because the need for a stenomask won't be going away, so extensive special-speaking training will continue to be required to work in classes and meetings.
There is one way around the problems of the stenomask, and the special training it requires.
That solution is to have the voicewriter in a different room from the speakers, listening in via microphone to the class or meeting room.
This solution is being used successfully in some settings, but not without problems.
We'll discuss this "remote" voicewriting in the next section.
As for the much-anticipated "promise" of ASR, transcribing the teacher's speech directly by computer, most speech scientists think it will take at least another 10 years. Why so long?
We've seen above that it's not just one small step away, but several major advances away: it must handle spontaneous expressive speech by multiple speakers who are speaking for humans rather than a computer, and automatically add punctuation and formatting.
Return to Top of Page
Remote Voicewriting
Getting rid of the stenomask can greatly improve ASR accuracy, but the only way to do that so as not to bother those sitting around the voicewriter, is to remove the voicewriter from the classroom or meeting room. Such "no-stenomask" voicewriting is growing in success for providing remote services, where the voicewriter is in a separate room or hallway, and listens to the teacher by telephone.
In this arrangement, the voicewriter can use a regular microphone and speak in a regular voice, making high accuracy much easier to achieve.
This approach serves a real need, but isn't a perfect solution for providing high-quality communication access services. The remote voicewriter is at a disadvantage in gathering the information being spoken because he or she is not in the class and thus not seeing the context of what the teacher and class are doing. Information on the board or other visual displays is usually not accessible to the voicewriter, and thus cannot be added parenthetically when needed to clarify what the instructor means by sentences like: "This picture here illustrates my point."
Also, remote voicewriters often cannot hear clearly what is said by others in the room who are far from the pick-up microphone(s).
So, while remote voicewriting can result in higher word recognition accuracy for what the remote voicewriter can hear, the non-spoken aspects of the communication, and the comments of everyone in the room, are usually not accessible. This loss of context and completeness can make many of the correctly-recognized words difficult or even impossible for the student to understand as intended by the speaker.
Return to Top of Page
Summary
In summary, speech recognition is currently unsuitable for classroom transcription because of high error rates.
Very high accuracy is required to preserve adequate sentence meaning.
A technological solution is still years away.
In the meantime, acceptable levels of accuracy and completeness are possible by a small group of dedicated professional voicewriters, using stenomasks in the classroom environment. Also, remote voicewriting, without a stenomask, can reach acceptable levels of recognition accuracy, although the completeness of the message delivered to the reader is usually compromised by the physical absence of the voicewriter.
Return to Top of Page
More Information
If you have any questions that aren't answered here, please contact us.
An example transcript from a hot new Internet transcription system, Google Voice.
Stuckless, Ross: "Recognition Means More Than Just Getting the Words Right" in Speech Technology, Oct/Nov 1999, p. 30.
A good discussion of problems with ASR in the classroom by a professor at the National Technical Institute for the Deaf (NTID) in Rochester, NY.
Automatic Speech Recognition discusses ASR in the specific context of communication access.
From Hearing Loss Journal, 2001.
Speak to Me is a web page with a typical user review of ASR.
Her conclusion, like so many others, is that ASR is really neat, but not as effective as the trusty keyboard.
Speech on Your Computer discusses why ASR isn't perfect yet, and laments typical 90% accuracy rates.
Article (pdf format) by Ben Schneiderman, an expert at University of Maryland on human-computer interaction.
Describes studies by users of IBM speech recognition, and by the US military, that discovered that it's much harder to edit while dictating than to edit while typing.
Return to Top of Page
|