We held our breath, waiting for the first utterance of our robot voice. It wasn’t going to be “Mama,” or “Daddy”. No. This was our mechanical baby. Talking with circuits, software, and digital utterances of phonemes being compiled at the speed of light like a phonetic jigsaw puzzle.
We’d recorded thousands of human-spoken lines which became phonemes to complete boring sentences like, “In a half mile turn right on….” Now we were waiting for the phoneme-aggregating, voice engine to combine the sounds that would speak “Gramercy Place”.
The complete sentence would start with the human voice and then the robot voice would fill in the street name by converting a text-to-speech (TTS) line into a human-sounding, “Gramercy Place.” Like the Wright Brothers waiting for their plane to crash we winced as the “Gramercy Place” part of the sentence approached. Would there be a shift in the basic characteristics : volume? presence? pitch? tone? diction? enunciation? pacing? Add to this another tier of twenty-four qualities in a voice, like “flutter” (a bleat like a lamb’s cry), or “honky” (excessive nasality), or my favorite “ventricular” (Louis Armstrong type voice.)
And we can add another tier of “listener interpretation” aspect to the voice. What a listener reads between the lines of a voice they hear. Does the speaker seem “sincere”, “caring”, “intelligent”, “comfortable”, “certain”, “empathetic”, “interested”, “apathetic”, “cold”, “faking it” etc.
As any actor will tell you, a line can be read in perhaps a hundred ways. How could we expect a machine to jump through all the performance hoops of a trained actor? So, how did our computerized baby sound? IBM had finished their work. The “Super Voice” was installed by Alpine into the Honda Odyssey and we sat around the parking lot, waiting for the first utterance. It was fascinating and eerie.
The voice sounded just like our voice talent, but, an “essence” was missing. With all the chopping and recombining of phonemes the “person” had vaporized. It was a “Stepford Wives” moment. Synthetic caring. The voice sounded just like her but “she” wasn’t there. She had slipped away during all the phoneme harvesting…escaping from the harvest…somehow dodging the phoneme harvester’s blades chopping up the elements of human speech.
Perhaps some engineer is working on a new algorithm that can inject “realness” into the synthetic personas we create. We need a “ghost” to live in the voice files. But so far, we do not have a ghost in the machine. We may soon have. Or we may hear the machine show its first chilling signs of being a real person when it quotes the line from 2001, “I’m sorry, Dave, I’m afraid I can’t do that.”
When you’re going for a ghost in the machine, you never know what you’re going to get.