This computer isn’t perfect. So it understands speech as well as you

In a significant breakthrough for artificial intelligence, voice recognition software can now understand language as accurately as humans, although grasping the context behind it remains elusive.

Researchers at Microsoft have created software that has a word error rate of 5.9%, which is about the same as a human transcriber.

“The research milestone doesn’t mean the computer recognized every word perfectly. In fact, humans don’t do that, either. Instead, it means that the error rate – or the rate at which the computer misheard a word like have for is or a for the – is the same as you’d expect from a person hearing the same conversation,” Microsoft said in a blog post.

The result has been edging closer for many years and comes just weeks after the same team reported that they had got the error rate down to a tantalising 6.3%.

Twenty years ago, the error rate of the best published research system had a word error rate above 43%.

Both IBM and Microsoft cite the advent of deep neural networks, which are inspired by the biological processes of the brain, as a key reason for advances in speech recognition.

Computer scientists have for decades been trying to train computer systems to do things like recognize images and comprehend speech, but until recently those systems were plagued with inaccuracies.

The new Microsoft programme relies on these deep neural networks as well as specialized graphics processing units that allow the software to learn at speeds not previously possible.

The milestone has far-reaching implications.

Recent research by Tractica forecast that voice recognition software licenses will pass 550 million worldwide by 2024. Consumer and healthcare uses are the strongest growth sectors but the technology has implications across multiple industries.

Annual Voice and Speech Recognition licences by region, 2015-2024

Researchers say more work is needed to improve the system in real-life settings, such as places where there is a lot of background noise. Research into identifying individual speakers when multiple people are talking is also a part of longer-term research efforts.

And, as anyone who has spent time shouting at Siri, Cortana or Google Assistant will testify, there is still a lot of work needed to enable computers to not just understand which words are being spoken, but their meaning and context too. It will still be some time before computers can answer questions or follow instructions with the same accuracy as humans.

Harry Shum, who heads the Microsoft Artificial Intelligence and Research group, “It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown.”

Leave a Reply