How Does Speech Recognition Work?

How Does Speech Recognition Work_Blog

More of us are getting used to telling Siri, Alexa, Cortana, and other digital assistants to set reminders, add items to lists, and tell us news, traffic, and more. Have you ever wondered how speech recognition* works?

The quick, basic answer is that you have a microphone that captures the sound, turns the analog signal into a digital signal, analyzes small bits and pieces, puts them back together, then produces an output based on what it heard. Let’s take a look at each of these stages, as well as the hurdles speech programs overcome.

* Most people use the terms voice recognition and speech recognition interchangeably. Speech recognition identifies words. Voice recognition identifies words and sometimes involves the additional task of identifying one person from others by his or her unique voiceprint.
How Does Speech Recognition Work_Pinterest

What Can Go Wrong?

Speech recognition software overcomes many challenges when figuring out what you said.

  • Background noise: What did you say and what was background noise? If other people are talking, can it tell which voice and words are yours?
  • Speaker differences: Voice qualities and pronunciations across age, gender, and regional accents each provide challenges to being able to tell what you said.
  • Continuous versus discrete speaking: Can you speak your words all at once or do — you — have — to — say — each — word — separately?
  • Sentence parsing, or the ability to separate continuous speech into separate words and build them into sentences. Advanced programs that use language modeling, statistical analysis, artificial intelligence, or a combination of methods separate a string of continuous sounds into words and sentences better. Even humans sometimes have difficulty, and misheard lyrics or sentences can sometimes be quite funny.

Analog to Digital

The first step of the process is to convert the utterance, or sound, from an analog signal into a digital signal.

An analog-to-digital (A/D) converter analyzes the intensity of sounds at various frequencies using a mathematical process called Fast Fourier Transform (FFT). The converter takes in the sound waves we usually associate with recordings (bottom graph of the picture) and turns it into a spectrogram (top graph of the picture).

Voice spectrogram and waveform
“Cottage cheese with chives is delicious.”


The spectrogram is then broken into bits of sound called acoustic frames that last 1/25 to 1/50 of a second. The acoustic frames are then analyzed for phones and phonemes.

In this case, phones are bits of sound, not devices we use talk and hold conversations. For example, dog has three phones: [d] [o] [g]. Phonemes are language-specific categorizations of sound. For instance, madder and matter have different phonemes but in American English often have the same phones. Or the word “bird” may different phones between someone from Seattle (bird) and someone from Brooklyn (boid) depending on their accents, but has the same phoneme.

Beads on a String

Once the software has converted the sound to digital input, broken it into bits, and analyzed it for phones and phonemes, it then puts the bits of sound back together as words like beads on a string. Since languages have thousands of words but only a few dozen phonemes (English and French have about 40, Spanish has around 25), it’s far easier to use a phonetic dictionary than a vocabulary dictionary to look up which words were possibly being said.

So What Did I Say?

Speech recognition can fall in one or more of the four different areas of software:

  1. Simple pattern matching
  2. Pattern and feature analysis
  3. Language modeling and statistical analysis
  4. Artificial neural networks

Simple pattern matching speech recognition is most often used by phone answering systems that recognize a small vocabulary of single words, such as numbers and simple yes or no answers.

Pattern and feature analysis allows a larger vocabulary by analyzing the key features, or phones, against a much larger phonetic dictionary. It helps if you speak each word separately with pauses between words. Many speech to text programs use this method, and have come a long way in the last ten to twenty years.

Some speech to text programs are a little smarter and use rules of language and grammar to be more accurate about knowing which word you said. Did you say to, too, or two? When you have different words that sound the same (homonyms), the language modeling uses statistical analysis rules to help figure out if you’re using a verb, noun, or other part of speech. Did you say, “I wanted to dance, sing, and have fun.” or “I wanted to dancing and have fun”? Statistically, past tense verbs like “wanted” are not used with present participle verbs like “dancing. The program can more accurately parse the sentence, or split a string of continuous sounds into separate words and put them back together based on structure. You can speak continuously and not with pauses between each word.

Note: The statistical analysis uses what’s called a hidden Markov model (HMM), assisted by Viterbi algorithm, to calculate the probability of what word comes next.

Artificial neural networks (ANN) are one aspect of artificial intelligence (AI) that works by being fed many examples (input) and then giving the answers (output). The program analyzes multiple elements of each input and eventually begins to “learn” how to give the correct answer. Some software personalizes itself to your speech patterns and word preferences, and can guess more accurately what you said (especially when used together with statistical analysis). ANNs are also better at knowing not just what was said, but the meaning of the words. Programs such as Siri, Alexa, Cortana, and Google Now move beyond writing your words but know what you said, perform commands, and return search results.