Humans still beat voice-recognition technology

Friday, October 9, 2015

Humans recognize a familiar voice much more readily than speech-recognition technology.

Voice-recognition technology is getting better all the time. In terms of recognizing the sound of a human voice, it’s much better already than the human ear and brain with short vocalizations. However, once a few syllables come out, humans are better than machines.

A new study out of the University of Montéal shows that people can pick out the voice of a friend amongst a crowd in as little as two words. Voice recognition technology isn’t quite that good yet. Here’s the press release from the university.

A related study shows how the ability to recognize a person’s voice can be helpful for students when it comes to learning and understanding lessons.

Published in October 2014 in the Journal of Child Language, this study shows that familiar voices can improve spoken language processing among school-age children.

Researchers at New York University found that the advantage of hearing a familiar voice only helps children process and understand words they already know well, not new words that aren’t in their vocabularies.

“Adults and children can process language really well in quiet environments or with headphones on. But most of life, including classroom learning, is done in environments that aren’t silent,” says Susannah Levi, assistant professor of communicative sciences and disorders and the study’s lead author.

“This study shows that children were able to integrate knowledge of what a person sounds like and use this to their advantage. A potential benefit is that when there’s background noise and kids are listening to a familiar voice, like a teacher’s, kids use the familiarity to their advantage.”

More than 99 percent of the time, two words are enough for people with normal hearing to distinguish the voice of a close friend or relative amongst other voices, says the University of Montréal’s Julien Plante-Hébert, via press release.

His study, presented at the 18th International Congress of Phonetic Sciences, involved playing recordings to Canadian French speakers, who were asked to recognize on multiple trials which of the 10 male voices they heard was familiar to them. “Merci beaucoup” turned out to be all they needed to hear.

Mr Plante-Hébert is a voice recognition doctoral student at the university’s Department of Linguistics and Translation. “The auditory capacities of humans are exceptional in terms of identifying familiar voices,” he said. “At birth, babies can already recognize the voice of their mothers and distinguish the sounds of foreign languages.

To evaluate these auditory capacities, he created a series of voice “lineups,” a technique inspired by the well-known visual identification procedure used by police, in which a group of individuals sharing similar physical traits are placed before a witness.

“A voice lineup is an analogous procedure in which several voices with similar acoustic aspects are presented,” he explained. “In my study, each voice lineup contained different lengths of utterances varying from one to eighteen syllables. Familiarity between the target voice and the identifier was defined by the degree of contact between the interlocutors.”

Forty-four people aged 18-65 participated. Mr Plante-Hébert found that the participants were unable to identify short utterances regardless of their familiarity with the person speaking. However, with utterances of four or more syllables, such as “merci beaucoup,” the success rate was nearly total for very familiar voices.

“Identification rates exceed those currently obtained with automatic systems,” he said. Indeed, in his opinion, the best speech recognition systems are much less efficient than auditory system at best, there is a 92-percent success rate compared to over 99.9 percent for humans.

Moreover, in a noisy environment, humans can exceed machine-based recognition because of our brain’s ability to filter out ambient noise.

“Automatic speaker recognition is in fact the least accurate biometric factor compared to fingerprints or face or iris recognition,” Mr Plante-Hébert said. “While advanced technologies are able to capture a large amount of speech information, only humans so far are able to recognize familiar voices with almost total accuracy,” he concluded.