Let's take a few minutes to talk about
sound.
We've mostly been talking about text and
tokens and
how we can use tokens in
um algorithms and neural networks, for
example, how we can use writing.
But how could we use data from spoken
language,
from speech and put it into a neural
network?
In order to do that, we're going to have
to take a brief look at how humans
produce sound. So speech
is mostly air, is a kind of signal
that travels through the air when your
lungs push air out ... Ah!
... into the air. This compression wave that leaves your lungs
passes through several filters, essentially
so the air coming out of your lungs first
passes
through your pharynx, and particularly
through your vocal cords
your vocal cords are
flaps controlled by groups of bones. I
recommend that you look for videos of
how they work online because they're really
cool. They are flaps that can expand, that can
uh be very open, or be very closed,
and they always suppose a certain
resistance to the upcoming current of
air.
So when the air hits the cords they open
and close, and open
close, and open and close, periodically
by doing this. They introduce a certain
frequency of vibration into the signal,
in addition to -
and by the way you can feel them
vibrating if you put your
hand against your throat, and say a vowel
for example,
ah ah. Pause the video and do it yourself
with any of the vowels.
Ah.
Can you feel how your throat is
vibrating? That's the vibration of the
vocal cords. This vibration then passes onto your
mouth, and your nose, and the shape of your
mouth changes the signal. Have you ever been in
a cave, for example, and noticed how caves can have like an
echo, echo, echo, echo, echo,
and have different echoes? This is
essentially what your mouth is doing.
The air is coming out of your lungs
through your vocal cords, and passes into
the cave that is your mouth,
and the shape of your mouth changes the
different echoes
of the freq - of the vibration of the
vocal cords.
For example your mouth can be wide open
when you say ah,
or it's a little bit more closed when you
say eee.
These changes in shape make it so that
the
echoes of the original vibration change
and what comes out of your mouth
is a fairly complex signal. As you can
see here,
it's not, oh here we go, the signal is
not a perfect uh sine wave,
for example but something that is fairly
jaggedy, and looks like it has several
frequencies mounted on one another, which
is precisely what happens.
It has a lot of echoes as it goes out of
your mouth.
What we can do is uh
capture this signal with, for example, a
microphone and then
decompose it into constituent
frequencies. This is called a performing
a fast fourier transformation
and the result of the transformation is
a spectrogram.
As you can see here, this is the
amplitude of the signal,
and this is the spectrogram of the
signal.
What these energy ribbons tell us is
that there was
some energy at lower frequencies,
there was some energy in this range of
frequencies,
and there's almost no energy in this
range of frequencies. So it can give us
information about the
vibration of the original uh parts of
your mouth as the signal is coming out.
The signal is many rich in cues. As you
can see,
there's larger concentrations of energy
here than here,
there's concentrations of null energy,
sometimes the the curve of energy goes
down very smoothly,
sometimes it cuts off drastically. So
there's a ton of cues that we could use
regarding frequencies, the intensity of the signal,
and changes in the pitch of the signal,
for example.
And i'm going to show you a quick
example. To your lower left you have the
url for a program called Praat that lets us dec -
um that records audio and then
gives you the spectrogram, decomposes the
frequencies to give you a nice
spectrogram, where we're going to see
this. Different vowels have different energy
prints. So this is some more - something where we
could easily tell vowels apart,
and I want you to remember this chart as
we go along.
Different vowels have different energy
imprints. Their maximum frequency, their
maximum energies,
occur at different frequencies.
So for example a vowel like a is going
to have
larger concentrations of - concentrations
of energy here and here and here
and this is very different from the
concentrations of energy here,
here, and here. So uh Praat!
Let's record some sound of me saying the
vowels of Spanish.
We're going to go to view and edit, and
here we have the audio signal,
and i'm going to play it for you.
And these red ribbons here represent
the distribution of energy, so for
example, ah has energy peaks at 851 hertz
1400 hertz and 2500 hertz.
Those are the first uh three peaks in
order from smallest to greatest, the first three
peaks of energy of a are at different locations,
405, 2321, 2600. The configuration of the first
three echoes, uh the first three
concentrations of energy,
as you can see, these are close together
and one two are close together, and
they're far apart. Three
here, one is very - is down below
and two and three and close, are close
together and above.
So as you can see, the energy imprints in
the two vowels are very different
and they're very different for all of
them.
Just as a final demonstration i'm going
to record
some phrase.
Las frecuencias de las vocales son todas
diferentes.
The frequencies of the vowels are all
different.
As you can see, the energy
has the the signal, has a certain energy
imprint that goes back and forth. So what
we could do
is that we could sample the energy, the
signal here,
and then here and here, and then for
example
one millisecond later, or five
milliseconds later, and as we go along,
we could measure the energies as each
point
of the signal, and indeed
each point of the signal is going to
have a slightly different energy print.
And this is what we can use to give this
to
like a classifier, for example we could
record,
for example, the first
peak of energy coming from below, we're
going to call that F1. The formal name is the formant. We could
record the second peak of energy,
the third peak of energy, as features and
then we could associate those features
with a label.
For example, the vowel ee, the vowel
ih and vowel eh.
This is how we could provide input to a
neural network, or to any classifier
so that they can take an audio signal
and then -
as input, and they can give us a sound as
an output.
By the way, many contemporary systems,
what they actually do is take not
three features, but 39 features um
different measurements of energy,
throughout each point.
It's uh at 12 points and it measures the
velocity of change,
the velocity of the energy, the
acceleration, and so forth,
but we use these spectrographic features
to then get uh get a label for what
sound,
that feature that energy might be
representing.
In summary, spoken language is
transmitted through
audio waves. The way the waves come out
of your
lung are deformed by your vocal cords.
They change frequency, they take even
more echoes and frequencies from your
mouth and your nose,
and then they come out of your mouth.
This produces a very complex and rich
signal that we can split apart into a
spectrogram.
We can then see where the highest
concentrations of energy are,
and use that as input for systems...
(I'm just going to let that sound.
Thank you for being here.) We can use that
signal as input for systems like
automated speech recognition.
On week nine, we're going to be looking
at the many intricacies involved
in doing speech recognition, but i just
um
wanted to get started on the topic.
