VOICE RECOGNITION 2005

LILLIAN BROWN

Is America ready for the next generation
of voice-operated communications?


The revolution in digital communication is proceeding at break-neck speed, bringing new technologies and new devices into everyday life. It is estimated that by 2005, a billion portable Internet connections will be deployed worldwide, severing virtually all technical limits on the retrieval and sharing of information. Voice recognition technology, already a pillar of the technology industry, will become an even more vital component of the next generation of communications products.

To quote Bill Gates, the individual voice print is the key to entering “a new kind of software system harnessing large computing systems, desktop personal computers, and a proliferating array of consumer electronic devices, all connected to the Internet.” Gates also predicts that computers a decade from now will be controlled with voice commands, rather than keystrokes or mouse clicks. This speed and convenience, however, will rely on clearly spoken commands, as technological advances will be unable to compensate, at least entirely, for broken English or garbled speech.

In the not-too-distant future, voice recognition technology will find numerous applications in the latest communications wizardry. Already in the trial stage are “smart houses,” where voice commands are used to control lighting fixtures and security systems, as well as toasters, refrigerators, dishwashers, microwaves, VCRs, and other appliances. Cars also will become “smart.” Future models will respond to voice commands to send e-mail, check the stock market, avoid traffic jams, and guide the driver to his or her destination, as well as more mundane chores like playing a CD, starting the engine, or turning up the heater.

As both chips and processors continue to shrink in size, an increasing number of devices will be worn on the wrist. Imagine the possibilities afforded by a wrist-speakerphone that uses voice recognition technology to “dial” a number, and even provides a video image of the person who answers. This would be hands-free technology at its finest. Wrist devices could also be manufactured to include a mini-recorder to record messages and memos, or operate as a digital camera, radio, data bank, or pager. For athletes, or for those with specific medical needs, wrist devices could monitor blood pressure and heart rate. There would never be an excuse to be late, as the simple wristwatch of today would become a powerful, multi-task device, capable of delivering accurate time synchronized with the hourly signals sent out by the US government’s atomic clock.

Even eyeglasses will become multi-purpose and high-tech. Future video technologies may include an eyepiece, held in place with a headband, which will float an image (such as an HDTV broadcast) right in front of the wearer’s eyes. Voice commands would be used to access the data or select programming.

And computers will continue to offer incredible possibilities. Using voice recognition technology, computers will be used to issue cash, tickets, and credit cards, or obtain passports and drivers’ licenses. Computers responding to the user’s voice print will also provide immediate and secure access to health records, financial data, and other personal information. Internet browsers will respond to the user’s voice, making accessing and using the Web a hands-free endeavor.

Computer processing of the human voice will become more streamlined and accurate. Children’s toys, already capable of rudimentary speech, will be capable of more complex interactions, such as guiding the child to the right way to spell a word or complete a math question. And voice communications via computer may soon become more common, as voice technologies improve. One day, a family dispersed around the world could gather on-line to chat as a group, or even sing or play instruments in unison. Audio and video conferencing will enter a new age, as software and hardware developments make it possible for participants to hear instantaneous, computer-voice translations of contributions made by non-English speakers.

Voice recognition technology will also have many applications in the security field. Police departments, for example, could use voice-chat technology to communicate with residents who wish to remain anonymous. Voice prints, rather than fingerprints, could be used to identify criminals, or control access to and from prisons.

THE CLARITY OF THE SPOKEN WORD

These technologies of the not-too-distant future will employ global digital transmission of the spoken word based on the International Phonetic Alphabet (IPA), which uses the 26 letters and 44 sounds of American English. The IPA is designed to represent each of these sounds with the unique symbols found in the dictionary.

While the possibilities for voice recognition applications seem endless, there are some potential limitations that hardware and software may be unable to overcome. The clarity of each individual’s spoken English will be essential. In listening to its user’s voice, for example, the computer will have to be trained to recognize the individual’s voice print. This will require the user to speak clearly into the computer’s microphone, and use terms and a technical vocabulary that correspond to the computer’s stored template. Thus, someone with an accent, or speaking too fast, mumbling, omitting syllables, or out of pitch, will create a jumbled “voice salad.” Indeed, users of the current versions of dictation software complain about finding nonsensical words or sentences when the computer fails to “hear” the correct words.

Each person’s voice print speaks volumes about the individual. Voice prints can reveal approximate age, sex, birthplace, education, regional dialect and, possibly, the speaker’s original ancestral language. As voice recognition becomes a more exact science, each spoken word can be broken down into five-millisecond intervals, and analyzed for average tonal content, shape, duration, elements of wave form, and amplitude. Testing a voice print involves sampling an individual’s speech pattern thousands of times, resulting in a spectrogram as distinctive as the individual providing the sample.

People for whom English is a second or third language are well aware of the challenge and necessity of speaking clearly. Air traffic controllers, for example, use radar to locate and track airplanes throughout the world, but must guide or otherwise communicate with the pilots by voice in English. Other occupations increasingly rely on English, the universal language of commerce. In the United States and abroad, business executives are flocking to voice and diction classes that help them to keep their individual personae and cultural identities while perfecting their spoken English.

Even native speakers of English often seek to improve their pronunciation, as they wish simply to be understood clearly by their audiences. Those who teach voice and diction classes, for example, find politicians, diplomats, business executives, doctors, and lawyers eager to improve their speech. Speaking for good recognition—whether for an audience or a computer— involves pronouncing every consonant, vowel, and syllable. It is important to exaggerate the consonant or consonants at the end of the word, such as the first choice, the cat in the hat,dad,bob,chug a lug, etc. These are the sounds more frequently distorted or lost by microphones or telephones.

It is also important to use American vowels, not the French, Italian, or British ones. “I cawn’t pak the cah” will not be translated correctly by voice recognition devices. And, as English is spoken more slowly than Spanish, French, or other languages, the speed at which the speaker talks also will be important. Fast talkers, in general, tend to think extremely fast, and have a hard time slowing down. But even the speediest of speakers can slow down, if instructed to enunciate every consonant, vowel, and syllable.

All of these requirements for clear speech suggest that the global communications revolution could benefit from a global effort to improve the clarity of the spoken word. After all, think how often it is necessary to replay the messages on an answering machine just to be able to write down the right telephone number. Or how frequently it proves useful to spell out names and addresses slowly, using “B as in baby, D as in David,” or the internationally recognized code: “Alpha, Bravo, Charlie, Delta, Echo....”

A number of universities throughout the world have established programs concerned with voice recognition and its technological applications. At the University of Bonn, the Institute for Communications, Research, and Phonetics specializes in speech synthesis. The University College of London, Edinburgh University, and the universities of Amsterdam, Birmingham, Cambridge, Ghent, Oxford, Sydney, and Sheffield all have centers dedicated to language, speech, and phonetics. Here, at home, the University of California at Santa Cruz, the University of Pennsylvania, Rice, Princeton, Yale, and other universities have specific departments engaged in speech-recognition research. At Michigan State University, the Pattern Recognition and Processing Laboratory is part of the Department of Computer Science and Engineering. At the University of Oregon, the Center for Spoken Language Understanding focuses on spectrographic reading technologies. At AT&T Bell Laboratories, one of the research programs focuses on the “obstacles to human-to-machine voice communication systems.”

Communications industry researchers, along with researchers at these and other universities, are no doubt concerned about some of the potential problems inherent in the new voice recognition technologies. Protection of intellectual property, and preventing fraud, copyright violations, and the pirating of products are among these problems. For those with smart houses, for example, there must be a way to ensure that personal safety and the house’s functions are not compromised during a power outage. And security measures, using the speech spectrogram as an identifying tool, must be developed to make sure that the person using the device is authorized to do so.

But many of these new programs are no doubt designed with a potentially larger problem in mind: how to improve both the speech-recognition technology and the clarity of common speech, so that the new devices will work properly. The wireless, keyboard-free “Internet in the Sky” of tomorrow will be of little use if it fails to understand the user’s vocal instructions.

Mankind has had the gift of speech for 100,000 years. Over 2000 years ago, the Greeks had the right idea in requiring citizens to become competent public speakers. Demosthenes, a famous orator, trained himself by putting pebbles in his mouth and trying to speak over the sound of the ocean waves. Today, a wine cork between the teeth and a tape recorder will do as well for anyone seeking to be understood more clearly. Speaking clearly, in a low-pitched and resonant voice, is a common courtesy to any listener. In the future, even casting any thoughts of courtesy aside, proper enunciation may well be necessary to use the next generation of voice recognition devices and all their functions.


[photo of Lillian Brown]
Lillian Brown (CC ‘95) is a media consultant specializing in image and voice coaching for members of Congress, broadcast journalists, and other public speakers. Formerly a TV documentary and radio producer, she is an adjunct professor at Georgetown University, teaching voice, diction, and public speaking. She also is the author of Your Public Best, a self-help guide to improving media appearances and public speaking.


[back]Return to COSMOS 2000 Table of Contents
[back]Return to COSMOS Journals
[back]Return to COSMOS Home Page