Hi, I'm Hannes. Hey, I'm Rodolphe. And last week together we made a workshop about speech
recognition at our job and one of the
parts of that workshop was about the
technology behind it and to be honest
it's not our field of expertise at all
so we had a hard time reading a lot of
stuff, a lot of difficult stuff, and we
never found a video that explains it
easily and now that we put a lot of
effort in it, maybe it's time that we put
this video online so here you have it:
A Speech Recognition for Dummies. 
Rodolphe: Oh yeah, and by the way last week we had the help
of two experts called e-Rodi and e-Hanni 
and those two guys are also gonna come
back to help us explain that to you
today.
Enjoy!
All right, so it all starts with us
humans making sound, making noise in
a normal environment. Actually, technically
we call that an analog environment and
the thing is that the computer cannot
work with analog data. It needs digital
data to be to be able to work with so
that's why the first piece of software
that we need is what we call an analog
to digital converter. Actually, we can
name that also a microphone. Hey, e-Rodi,
can you help the people here understand
how a microphone works? 
E-Rodi: Okay sure I'll help.  I'll pretend to be a microphone
converting a sentence from analog to
digital. Which phrase do you want me to
use? 
Rodolphe: How you doing? 
E-Rodi: All right, let's go! In
order for you humans to see the
conversion, we computers use a
visualization called spectrogram. To
create the spectrogram three main steps
are needed.
First, I capture the sound wave and placed
it in a graph showing its amplitude over
time. As you can see the amplitude units
are decibels and we can even guess here
the three words that you just said. Second, I chop this wave into blocks of
approximately a second. I'm not really
that good of a microphone to be honest
but I've colleagues that can make this
blocks much thinner. As you can see the
height of the blocks are determining its state. To each state we can allocate a
number and a number being something
digital we have successfully converted
this sound from analog to digital. Two
steps down, one more to go! Even if the
data is digitized, we are still missing
something. In the speech recognition
process we actually need three elements
of sound: its frequency, its intensity and
the time it took to make it. Therefore we
will be using a super complex formula
called Fast Fourier Transform to convert
the graph you're currently seeing into
what we call a spectrogram. To ease your
understanding
I show you here both a handwritten
version and a computer made version of
the spectrogram. As you can see the
spectrogram shows the frequency on the
vertical axis and the time on the
horizontal axis. And the colors are
actually the energy that
you use to make the sound. The brighter
the color, the more energy was used. The
last interesting fact about the
spectrogram is the time scale. As you can
see it's way more precise: each vertical
line is between 20 to 40 milliseconds
long and is called an acoustic frame.
Okay, now it's time to really be honest. I'm a
good analog to digital converter but
actually my job stops there. Although I
managed to have a digital version of the
sounds you made, I have no idea
whatsoever what they're supposed to mean. If they even mean something. So I
suggests that my colleague e-Hanni
tells you all about how the computers
can understand the meaning of sound. But
before that our real-life versions have
to do a little work. Let's split the work
like this: you guys explain the concept
of phonemes and we take over the rest of
the heavy explanation.
Rodolphe: You got yourself a deal. All right, it's time for a little introduction to linguistics.
What is a phoneme? A phoneme is as small as
20 to 40 milliseconds, so it's
super super short and it's a unit of
sound that distinguishes one word from
another in a particular language. To put
it differently, it's the tiniest part of
the word that you can change and that
also makes the meaning of that word
change. For instance, the word thumb and
the word dumb are two different words
that you can distinguish by the
substitution of one phoneme 'th' by another
phoneme 'd'. Those phonemes can be spoken
differently by different people but it's
always the same phoneme that is meant.
Those variations are what we call allophones.
And the reasons of those
variations are the accent of the person
its age, the gender, the position of the
phoneme within the word or even the
emotional status of the speaker. Those
phonemes are important because they
are the very basic building blocks that
the speech recognition software can use
to actually put them in the right order
to first form a word, then afterwards
a sentence and etc. So the speech
recognition software does that by using
two techniques: the Hidden Markov Model
and the Neural Networks.
All right, so I'll explain those two. And
let's maybe first start with the Hidden
Markov Model. So as Rodolphe just said, it's actually the purpose to reconstruct the
phrase that has just been said so by
putting the right phonemes after each
other and the Hidden Markov Model does
that by using statistical probabilities.
So we will check how probable it is that
one phoneme follows
after the other and so on. To be precise:
the Hidden Markov Model does that using
three different layers and maybe, e-Hanni,
you can help us by visualizing this?
E-Hanni: Okay sure, I'll help. I'll pretend to be a
speech recognition system using a Hidden
Markov Model. Which phrase do you like me to use?
Hannes: Dolphins swim fast.
E-Hanni: All right, let's go.
So first of all the model has to
check on an acoustic level the
probability that the phoneme it has heard
really is that phoneme. So that means, as
Rodolphe just said, we say phonemes in a very different way according to emotion
position in the phrase, and so on. And so
the system first needs to check whether
that variation it has heard in a phoneme
really is that phoneme.
E-Hanni: Okay, so the first
utterance I recorded in the phrase of
Hannes was d. Statistically seen, Hannes
could have said 't', 'th', or 'd'. But most likely,
most probably, it was a 'd'. So let's take that one.
So, once the software has
reached a decent probability of what the
most likely said phoneme is, then it is
time to go to the second layer. And in
the second layer the Hidden Markov Model
will actually check whether phonemes
next to each other, if it's probable that
they are standing next to each other, yes
or no. So maybe an example in English: if
you have the sound 'st', then it's most
likely that a vowel will follow for
example an 'a', such as in stable and it's
less likely, or maybe not even
possible in English, to have the sound 'n'
after it because 'stn': I don't think
it exists and if it does then it's not probable.
E-Hanni: Ok, so after the 'd' I heard an 'o'.
Statistically seen it is actually quite
probable that an 'o' followed after the 'd', so
let's keep it that way. After that I've also
heard an 'l' and again it's quite probable
that an 'l' follows after an 'o'. So, I think
I've put together the first phonemes to
make a word. The word 'doll'.
Hannes: let's see about that, e-Hanni! Because in the third layer
now the software will check on word
level, so it will check whether words
standing next to each other if that's
probable and if it makes sense.
So for example, it will also check if there are
too many verbs or too few verbs in the phrase,
it needs adverbs, if there are
enough subjects in it, and so on...
E-Hanni: Well, I think I already have to go back to the
second layer again because while you
were talking I've put together the second
word and it's Fins. But 'doll Fins', it
doesn't really make sense. So let me go
back to the second layer and reassess...
Ah, you probably said dolphins. Alright! Now the next phonemes I've put
together made the word 'swim' and the
word 'passed'. But now my phrase doesn't
really make sense, because I have two
verbs. So let me maybe check 'passed' again.
I need to find an adverb that sounds like
'passed' so that my phrase is grammatically
correct. So let me go back to the
previous layers again and... I already see it.
It seems like the 'p' in the first
layer could also be 'f' and then it makes
'fast'. 
Dolphin swim fast!
Hannes: That's right, e-Hanni. Now people who
sometimes dictate to their phone,
they may already have seen this
happening. So the more input you give to
your phone, then it may be that sometimes
words in the beginning of your phrase
start changing because the system has
become wiser, it knows what you're trying
to say, or not trying to say, and that's
why it changes some words.
E-Hanni: So in short
about the Hidden Markov Model. It has a
great fit with a sequential nature of
speech. However it's not that flexible
Also, all the varieties of the
phonemes, it cannot really grasp it,
it's too much.
Hannes: All right, next to the Hidden Markov Model,
we also have the Neural Network.
So let's maybe talk a bit about that one.
And a good thing about the Neural
Network is that this one is flexible. So
as the name says itself, the Neural
Network is actually, the working of it, is
based on how our brain works. So it's
with a lot of nodes that are all
connected with each other.
And maybe let's visualize again, 
so e-Hanni, can you help us?
E-Hanni: Yes!
So a Neural Network is built up by an input layer, a
hidden layer, and an output layer.
The middle layer can be composed of many
different layers. Now, as you can see, the
connections all have different weights, ...
so that only the information that passes a
certain threshold will be sent through
to the next one. Next to that it also means
that if a node has to choose between two
inputs so here C has to choose between
the input of A or B, then it will choose
the input of the node with which it has
the strongest connection. So in this case
it will take the information from A.
Sometimes, in some systems, it can also
take both inputs and that makes a ratio
of it. So here you can see that it takes
most of the input of A, but also a
little bit of the input of B.
Hannes: The interesting thing about Neural Networks
is that it's flexible, so it can change
over time. This means that in the
beginning we have to train the Neural
Network which also means that in the
beginning all the different connections
have the same weight. 
E-Hanni: Yes, indeed! So here
you can see an empty neural network so
that means that everything has the same
weight. So we will give a certain input
to the neural network and we will say
what's the desired output is. Then we
will let the neural network do its thing
and it will come up with a certain
put which is of course not the same as
desired outputs. Because it's still young, it 
still needs to be trained. The difference
between that we call the error. We also
tell it to the Neural Network, that there
is an error. From that point the Neural
Network can start adapting itself, ... so
that it can make the error smaller.
Now, for the Neural Network to improve, to
keep improving, it needs a lot a lot of
inputs to make the error go away.
Hannes: And that's a downside. Another downside is
that it has a bad fit with a sequential
nature of speech but on the plus side,
as already said, it's flexible and it
can also grasps the varieties of the
phonemes and with that I mean that it
can see a difference between unique
voices, emotions, phonemes in the
beginning or at the end of the phrase,
and so on. So that's really good. Now I think it's for 
e-Hanni to do the conclusion. Right, e-Hanni?
E-Hanni: These plus and
downsides are very compatible with the
plusses and downs of the Hidden Markov Model. That is why the Hidden Markov
Model and the neural networks are often
combine nowadays. So that's why we talk
about a... hybrid!
So that was it. Actually,
we tried to put all the difficult parts
in a coherent story and now we hope you
enjoyed it and if you're interested
about it, please go look it up on the Internet.
Ciao! Bye bye!
