Speech recognition software enables phones and computers to understand human utterances - be that a question, a command, or a general exclamation. And while a few decades ago this would have been in the realms of science fiction, nowadays it's a firmly established part of everyday life.
From checking the weather forecast and picking a playlist to sending texts and verifying your identity, the use of speech recognition is already so ingrained in society that we rarely give it a second thought!
But where did this technology come from? When did it all begin? And what does the future look like? Let's take a look at the history of speech recognition, how it's used today, and what the future has in store.
The first ever speech recognition system was built in 1952 by Bell Laboratories. Nicknamed 'Audrey', the clever system could recognize the sound of a spoken digit (zero to nine) with more than 90% accuracy - but only when spoken by its developer. It was much less accurate with unfamiliar voices.
IBM showcased the Shoebox at the 1962 World Fair in Seattle. The device could understand 16 spoken English words. Later in the 1960s, the Soviets created an algorithm that could recognize 200 words. These were based on individual words being matched against stored voice patterns.
A US Department of Defense-funded program at Carnegie Mellon University developed the Harpy, which had a vocabulary of over 1,000 words. The biggest breakthrough here was that it could recognize not only words, but whole sentences.
IBM was back at the forefront in the 1980s with a voice-activated typewriter called Tangora. It had a 20,000-word vocabulary and used statistics to predict and identify words.
In the early 90s, Dragon Systems released the first consumer speech recognition product, called the Dragon Dictate. In 1997, an upgrade called Dragon NaturallySpeaking was released. This was the first continuous speech recognition product, and it could recognise speech at a rate of 100 words per minute. This technology is still used today - in fact, it was acquired by Microsoft in 2021!
AI speech-to-text technology has come on leaps and bounds in the past couple of decades. Google has led the way with its voice search product, and the likes of Apple, Amazon and Microsoft are all key players too.
There are two types of speech recognition: speaker-dependent and speaker-independent.
Speaker-dependent speech recognition software is trained to recognize a specific voice, in a similar way to voice recognition software.
New users have to 'train' the program by speaking to it - which often involves reading a few pages of text. This way, the computer can analyze the voice and learn to recognize it.
Speaker-dependent speech recognition generally provides very high accuracy.
Speaker-independent software is designed to recognize anyone's voice, which means no training is involved. The software is focused on word recognition rather than a specific voice.
This type of speech recognition is generally less accurate, but it's the only real option for interactive voice response (IVR) applications, such as those used by call centers, as businesses can't ask callers to read pages of text before using their systems.
Here are some of the ways that speech recognition software is now used in everyday life:
Whenever you say "hey Siri", it's speech recognition software that powers these virtual assistants and allows us to use our devices just by talking!
Smart speakers like Amazon Echo and Apple HomePod also have virtual assistants built into them. 320 million smart speakers were in use in 2020, and this is set to double by 2024!
Speech recognition is at play every time you call a call center and a recorded voice asks you to state your name, reference number, or a summary of your query. This is known as Interactive Voice Response.
Many security systems, like those used by banks, use voice biometry as a means of security checking a customer.
Automatic transcription services, like Transcribe, use speech recognition to convert speech into text, providing you with transcripts within minutes, if not seconds.
Speech recognition is set to become more and more widely used. For example:
The global voice recognition market size is forecast to grow from $10.7 billion USD in 2020 to $ 27.16 billion USD by 2026
The number of smart speakers is set to double (from 320 million) by 2024
The number of digital voice assistants used in devices worldwide will also double, from 4.2 billion units in 2020 to 8.4 billion units in 2024 - a number higher than the world's population.
The more it's used, the more speech data that's collected, and the more investment that's pumped into it, the more accurate speech recognition software will get. It will get better at understanding different accents, differentiating between speakers, and even recognising emotions. Eventually it may also learn to understand different languages and dialects simultaneously.
No one can know for sure exactly what the future holds, but speech recognition software is on track to get better, more accurate, and more useful than ever before.
If you found this interesting, you might like to learn more about AI transcription, including how it works, how it's used today, and what you can expect from it in the future.
Written By Katie Garrett