[MUSIC]
Stanford University.
>> I am very excited to
introduce to you Navdip.
Navdip did his PhD At Toronto with Geoff
Hinton the godfather of deep learning.
He's done some really exciting work
on end-to-end speech recognition.
Really his name is on most of the exciting
breakthrough papers of the last couple of
years when it comes to speech.
So very excited to have him here.
He's now at Nvidia and I'm guessing,
continuing to work on speech recognition.
Okay, Is that working?
Okay, hi everyone.
So, today I thought I'd give
a high-level overview of methods
that we're looking at for
end-to-end speech processing.
So here's the plan for the lecture today.
I'll start by taking a brief look at
traditional speech recognition systems.
Then I'll give a little motivation and
a description of what I mean
by an end-to-end model.
Then I'll talk about two different models
for an end-to-end speech recognition
systems, one by the name of
Connectionist Temporal Classification.
And another recent one based on Listen,
Attend and
Spell which is
a sequence-to-sequence model,
something I believe you guys
are familiar with it at this point.
Then I'll talk about some of the work
we've been doing on making improved
versions of these end-to-end models,
and end with talking a little
bit about how language models
influence speech recognition, and
some efforts at improving decoding which
is an important part of these models.
Okay, so starting with
the basic definition of what is
Automatic Speech Recognition,
in the era of OK Google,
I guess I don't really need to
describe it, but here it goes.
You have a person or
an audio source saying something textual.
And you have a bunch of microphones
which are receiving the audio signals.
And what is received the microphone of
course depends on how it is oriented
with respect to the person.
And you get these signals from
the devices one or many devices.
And you pass it to
an Automatic Speech Recognition system,
whose job is now to infer
the original source
transcript that the person spoke or
that the device played.
So why is ASR so important?
Well firstly it's a very
natural interface for
human communication,
you don't need a mouse or a keyboard,
so it's obviously a good way
to interact with machines.
You don't even really need to learn
a new techniques because we most people
have learned how to speak by certain time,
and of course it's a very natural
interface for talking with simple
devices such as cars or handout phones.
Or even more complicated and
intelligent devices such as your
call center people or chatbots, and
eventually our robotic overlords.
So I think we need a good
speech recognition system.
Okay so how is this done classically?
I'll be focusing mostly on
stuff we've been doing lately,
which are all neural inspired models, but
I thought it would be a nice start by
talking about how these things
have been built classically and
see how they've been replaced with
single neural net over the time.
Okay, so the classic model for
speech recognition is to build
something called a generative model,
I don't know how many people are familiar
with generative models here?
Perfect, so the classic way of building
a Speech Recognition System is just to
build a generative model.
You build the generative
model of language.
On the left you say, you produce a certain
sequence of words from language models.
And then for each word you have
a pronunciation model which says, hey,
this is how this
particular word is spoken.
Typically it's written out as the sequence
of phonemes which are basic units
of sound, but for our vocabulary we'll
just say it's a sequence of tokens
where tokens have been a cluster of things
that have defined by a linguistic expert.
So you have the pronunciation models which
now convert the sequence of text into
a sequence of pronunciation tokens.
And then the models feed
into an acoustic model
which basically says how does
a given token sound like?
And typically these are built
using a Gaussian mixture model or
they were in the past, and these Gaussian
mixture models with a very specific
sort of architectures
associated with them.
You'd have three state left to right
Gaussian mixture models that would
output frames of data.
And these models were now used
to describe the data themselves.
So here the data would be x,
which is the sequence
of frames of audio features x1 to xT.
Typically these features are something
that signal processing experts
have defined things like features
that look at frequency components of
the audio wave forms that are captured.
Things called spectrograms and
bell filter banks spectrograms
that are sort of very
similar to human beings.
So the classical pipeline
proceeds as described.
Now each of these different components
in this pipeline uses a different
statistical model.
In the past, language models
were typically N-gram models,
they served very well.
So here obviously in this class
I don't really need to define
language models for you,
you have essentially tables describing
probabilities of sequences of tokens.
The pronunciation models were just
simple lookup tables with probabilities
associated with pronunciations
most words will have at least
a couple of pronunciations and
if you have an accent in exchange further.
So these pronunciation tables would
just be very large tables of
different pronunciations.
Acoustic models as I said would
be Gaussian Mixture Models and
the speech processing was predefined.
So the way,
once you have this model built,
you can do recognition by just doing
inference on the data you receive.
So you get some waveform,
you compute the features for it and
you get x, you look into your model.
And then using some fancy search
procedure, you figure out, okay,
what's the sequence of y's that
would give rise to this sequence
of x with the highest probability?
So in a nutshell that's basically all,
the way classical speech
recognition systems happen and
all the magic is in how these
little pieces are refined.
Okay, so now welcome to
the neural network invasion.
Over time people started noticing
that each of these components could
be done better if we
used a neural network.
So, if you take an N-gram language model
and you built a neural language model and
feed that into a speech recognition system
to restore things that were produced
by a first path speech recognition system,
then results are improved a lot.
Also they looked into pronunciation
models and figured out, hey, how do
we do pronunciation for new sequence of
characters that we've never seen before?
So most pronunciation tables will not
cover everything that you could hear
They found that they could
use a neural network.
And then learn to predict token
sequences from character sequences, and
that improve pronunciation models.
So very, More
production related speech recognition
systems such as the Google one,
will build pronunciation
models just from RNNs.
Acoustic models, also the same story.
People used to use Gaussian
mixture models, and
they found that if they could use DNN,
or Deep Neural Networks, or
LSTM-based models, then you could
actually get much better scores for
weather frames for real or wrong.
Interestingly enough,
even the speech processing that was
built with analytical thought process
in mind about the production of speech.
Even those who are found to be
replaceable with convolutional
neural networks on raw speech signals.
So each of the pieces over time,
people have found that neural
networks just do better.
However, there's still a problem.
There's neural networks
in every component, but
the errors in each one are different,
so they may not play well together.
So that's the basic motivation for
trying to go to a process where you train
the entire model as one big model itself.
And so the stuff I'll be talking about
from here is basically an attempt or
different attempts to do the same thing.
And we call these end-to-end models
because they try and encompass more and
more of the pipeline
that I described before.
And the first of these models is called
Connectionist Temporal Classification, and
is in wide use these days in Baidu and
even at Google, the production systems.
However, it requires a lot of training.
And recently,
the trend in the area has been to try and
build an end-to-end model that does
not require hand customization.
And sequence-to-sequence models
are very useful for that and
help talk about Listen Attend and
Spell, which is one of these models
Okay, so the basic motivation is we want
to do end-to-end speech recognition.
We're given some audio x.
It's a sequence of frames x1 to xt.
And we're also given during training
the corresponding output text y, y1 to yL.
And each of these y's is
one of whatever 27, 28,
some number of tokens,
letters, a, b, c, d, e, f.
Not sounds, we're trying to going straight
towards a model that goes audio straight
to text, so we didn't wanna use
any pre-defined notions of what it
means to be a different phoneme.
Instead these models will start with x and
they have a goal to try and model y.
So y is just the transcript, and
x is the audio possibly processed
with some very minimal amount
of frequency based processing.
So now, what we want to do is
perform speech recognition
by just learning a very,
very powerful model, P of Y given X.
So the first model that
describe the classical way of
doing these things is the one at the top.
You start with y, it's a language model,
we look into pronunciation models.
And you look into acoustic models and
you get some scores.
These models, for end-to-end, actually
just collapse them into one big model and
reverse the flow of the arrows,
so they're discriminative.
You start with data,
which is X and the features, and
your goal is to directly predict
the target sequences, Y, themselves.
Obviously, this requires a very
powerful probabilistic model
because you're doing a very
difficult inversion task.
And I'd say, the only reason this is
possible now is because we have these
very strong probabilistic
models that can do that.
Okay, so the first of these models is
Connectionist Temporal Classification.
This is a probabilistic model p(Y|X),
where again,
X is a sequence of frames of data,
X1, X2, Xt.
Y itself is the output tokens of length l,
Y1 to YL.
We require, because of the way the model's
constructed, that T be greater than L.
And this model has a very specific
structure that makes it suited for speech,
and I'll describe that in a second.
So again, X is the spectrogram,
Y is the corresponding output transcript,
in this case, this is the spectrogram.
Okay, so
the way this model works is as follows.
You get the spectrogram at the bottom, X.
You feed it into
a recurrent neural network.
You'll notice that their arrows
are pointed both directions.
This is just my way of drawing
out a bidirectional RNN.
I'm assuming everybody knows
what a bidirectional RNN is.
Okay, so it's a bidirectional RNN.
As a result,
this arrow pointing at anytime step
depends on the entire input data.
So it can compute a fairly complicated
function of the entire data X.
Now, this model at the top has softmaxes
at every time frame corresponding to
the input.
And the softmax is on
a vocabulary which is the size of
the vocabulary you're interested in.
Say, in this case,
you had lowercase letters a to z and
some punctuation symbols.
So the vocabulary for connectionist,
for CTC, would be all that and
an extra token called a blank token.
And I'll get into the reason for
why the blank token exists in a second.
I think I forgot to point out
one important thing here.
Each frame of the prediction
here is basically
producing a log probability for
a different token class at that time step.
And we'll call that a score.
In this case, a score s(k,t) is
the log the probability of category k,
not the letter k, but a category k
at time step t given the data x.
So you'd have, let’s say,
you took x4 here.
The probability of, if you look at
the softmax the first, let’s say,
the first index corresponds to
the probability of character a.
The second symbol corresponds to
the probability of character b, c,
and so forth.
And the last symbol in this softmax will
correspond to the blank symbol itself.
So when you look at
the softmax at any frame,
you can get a probability
of the class itself.
Okay, so what CTC does is if you look at
just the softmaxes that are produced by
the recurring neural network
over the entire time step,
you're interested in
finding the probability of
the transcript through these
individual softmaxes over time.
What it does is,
it says I can represent all these paths.
I can take a path through
the entire space of softmaxes,
and look at just the symbols that
correspond to each of the time steps.
So if you take the third symbol,
that's a C in the first time step.
If you take the first, the third symbol
again at the second time step C, and
then you go through a blank symbol.
It's essential that every symbol
go through a blank symbol,
that's a constraint that the model has.
And now, you go through a blank symbol,
and then you produce the next character,
A, and then you produce A again for
another frame.
And then you have produce a blank,
and you can transition to a T, and
then you have to produce a blank again.
So when you go through these paths
with the constraint that you can
only transition between either the same
phoneme from one step to the next one.
Or take it from not a phoneme,
but a label,
either from the same label to itself,
or from that label to a blank symbol.
You end up with different ways of
representing an output sequence.
So here we had the output
sequence path being representing
in these frames as cc
blank aa blank t blank.
There's many other paths to also
correspond to the character sequence cat.
So for example, if you wanted to produce
cat from the sequence of tokens here,
you could also have produced it from
the way the second line here maps it out
where you would say cc blank blank,
and then produce an a,
then you produce a blank, and then you
produce a t, and then you produce a blank.
So all this sounds complicated,
but really all it is, is saying,
there's some paths you can take
through this sequence of softmax and
it's got a certain constraint that you
have to follow, namely, you can only
transition between yourself and the same
symbol again or yourself and a blank.
Given these constraints, it turns out even
though there's an exponential number of
paths by which you can produce the same
output symbol, you can actually do it
correctly, because there exist
the dynamic programming algorithm.
I'll spare you the details of that
dynamic programming algorithm today,
but it's not as complicated as it sounds,
I'd refer you to the paper if you're
interested in finding
out what that is about.
Anyhow, the nice thing about this model is
you are able to sort of take in inputs and
produce output tokens and learn this
mapping because the probabilities can be
computed accurately and
not only can the probabilities for
an output sequence be computed accurately,
you can get a gradient
which is the learning signal that
you require to learn a model.
And once you get the gradient from that
learning signal, you can back-propagate
that into the recurrent neural network and
learn the parameters of the model.
Feel free to ask any
questions at any time.
I'm happy to answer questions.
Sure, mm-hm?
So the question is,
are we using processed raw audio, or
can we can we actually do raw audio or
maybe even what do we do in practice?
So we found some years ago that
we could actually use raw audio,
however it didn't actually beat the best
way of processing the raw audio minimally,
which would be just
computing a spectrogram and
then adding the sort of bias that
human hearing apparatus has.
It turns out the human hearing
apparatus doesn't have a linear
resolution on frequencies.
We're able to separate frequencies that
are fairly close by in lower ranges and
then it becomes logarithmic and
you have to be really far apart in
frequency space to tell them apart.
So if you impose that bias,
you get a log mel spectrogram instead
of just a linear spectrogram.
People have tried to improve
upon the log mel spectrogram but
the attempts have been not very good
when it comes to single channel.
There's the case where you
might have multiple devices
where you have multiple microphones
that are recording things, and
there you can actually subtleties such as
one microphone being closer to a person
than the other and then it turns out
that you can actually use raw wave
forms to produce better
results then in spectrograms.
So I haven't talked about that much here,
but what you feed in,
it's really not that important for
the sake of calling it end to end.
Let's just say it's a little
convolutional model that works on
frames of raw wave forms.
So you just take the raw wave form and
you split it up into little frames and
that works just as well.
Unfortunately, it doesn't work better yet
as we had originally hoped unless you
have multiple microphones
in which case it does.
So here's some results for CTC,
how it functions on a given audio.
This audio stream corresponds
to his friends, and
if you look at the model I've aligned or
lifted from the paper, so we have aligned
a raw wave form at the bottom, and
the corresponding predictions at the top.
So you'll see, it's producing
the symbol H, at a certain point,
it gets a very high probability,
it goes straight from 0 to 1 so
it's confident it's heard
the sound corresponding to H.
There's a faint line here which
corresponds to the blank symbol, and
you'll see that when you
want to emit symbols,
the blank symbol probability
starts to dip down to zero.
So they swap in terms of probability
space because you're not
confident you want to produce an H symbol.
Over time, you can see this
H now dies down to zero and
of course, as the audio proceeds,
you start getting high probabilities for
the other characters
corresponding to the sounds.
To give you some examples on what this is,
how this looks like when you just
take about 81 hours of data and
just train to the text
corresponding to 81 hour of audio.
So imagine you're a child.
You are born, you listen for 10 days and
you start producing text like this.
So the target is to illustrate the point,
that is if you listen eight hours a day,
which most kids don't.
To illustrate the point
a prominent Middle East analyst in
Washington recounts
a call from one campaign.
Two alstrait the point
a prominent Midille East analyst
im Washington Recouncacacall
from one campaign.
Here's another one,
I'll let you read that yourself,
I'll just point out boutique,
we can bootik,
it's kind of cute and
sometimes it gets it quite good.
So it's pretty interesting
that it produces text that very much is
like the output text that you desire.
Of course,
it turns out these sound very good.
If you read out the transcript,
it sounds like what you've heard.
So, clearly something is missing.
What's missing is correct spelling and
also a notion of grammar.
So if you had some way of figuring out,
of ranking these different paths
that you produce from your model and
re-rank them by just the language model,
you should get much better.
It turns out in this case, the original
base model had a word error rate of 30%.
That means of the words that it was
producing, 30% were wrong which seems
like a very big number, but even a small
spelling mistake will cause that error.
Now if you use a language model to
sort of re-rank different hypothesis,
you can get that word error rate done to
8.7% which is just using 81 hours of data.
Subsequent to this work, Google looked
into using CTC where they actually
have a language model as part of
the model itself during training, and
that's kinda the production models you use
now when you call in with Ok Google and
that fixed a lot of these issues.
And of course they used thousands of hours
of data instead of 81 hours of data.
There's no such thing as Big Data,
or big enough.
So if you look at their paper and look at
some interesting results when you change
the targets, instead of using characters,
you can actually use words.
You can have a different
vocabulary size for words and
see how the recognition system performs.
So here the top What the panel is where
there are 7,000 words in the vocabulary
and the bottom panel is where there's
90,000 words in the vocabulary.
The actual text to be produced is
to become a dietary nutritionist.
What classes should I take for
a two year program in a community college?
If you'll look carefully in the panel not
sure if it's entirely visible all the way
back, but these things are color-coded
in terms of the different words
that were produced in this window and
corresponds to the probability.
So here, there's the blank
symbol which again, goes up and
down, depending on weather symbols
are being produced or not.
It has this word to which is
the first word here in green,
so it's got a high probability.
Also, note that it also is confused
a little bit about the word do at
the same time.
It's either to or do.
Do has a much lower probability.
So it produces to and then it
produces nothing, but a blank symbol.
And then it gets to the next word and
it produces the word become,
and then the word a shows up,
and then dietary.
Turns out dietary is not in
the vocabulary size of 7000, so
it just produces diet which
is in the vocabulary and
you'll see this row here
corresponding to diet being yellow.
It also produces Terry, the name.
It doesn't have dietary, but
just produces Terry as a response.
It also produces military and
some other targets.
But overall, it gets most of the words
that it's expecting correct.
If you increase the vocabulary
size to be large enough, so
that dietary is in the vocabulary,
you find that it actually also produces
dietary as an output in there,
although it also produces diet.
So, a language model would
fix that kind of an issue.
So, that's what I have to say about CTC.
I'm afraid I have to switch here and.
Let's see, there we go.
I promise to manage with four switches,
yeah.
So now,
switching gears in terms of the models.
The CTC model is interesting.
But if you were paying attention very
carefully from a modeling perspective,
you'd find that the model makes
predictions just based on the data.
And once it's done with making
those predictions for each frame,
there's no way of
adjusting that prediction.
It has to the best it can
with those predictions.
An alternative model is
the sequence-to-sequence models,
which you guys have been reading
about from looking at your lectures.
So, I won't talk too much about
the basics and jump straight in.
We have a model here,
which basically does next step prediction.
You're given some data x and
you've produced some symbols y1 to yi,
and your model just going to predict the
probability of the next symbol of yi+1,
and your goal was basically to
learn a very good model for p.
And if you can do that, then you have
a model for p of any arbitrary y given x.
So this model that does speech recognition
with the sequence to sequence framework,
changes x.
In translation,
this would be a source language.
In speech, the x itself is
this huge sequence of audio.
That is now encoded with
a recurrent neural network.
What it needs to function is
the ability to look at different parts
on temporal space,
because the input is really, really long.
I think if you look at
translation results,
you'll find that translation gets worse
as the source sentence becomes larger.
It's because it's really hard for
the model to look precisely
at the right place.
Turns out that problem is aggravated
a lot more when you have audio streams.
Audio streams are much longer.
Typically a frame is like,
you have a hundred frames for a second.
And when you wanna transcribe
something that's ten seconds long,
that's about a thousand.
Token thousand input times
are long as opposed to.
In translation, you might have like 40
token words that you're gonna translate.
So it's a very aggravating problem and
you need to do attention if you
wanna make this model work at all.
Whereas with translation,
you can get by without attention.
So, how exactly does attention work here?
You're trying to produce
the first character C and
you have this way of producing
an attention vector.
I'll go into that shortly how that's done,
but it's fairly standard.
This attention vector,
essentially looks at different parts
of the time steps of the input.
Here it's saying,
I want to produce the next token.
And to produce that token, I should really
look at the features that were over here
and the features that were over here.
So once it looks at the features
corresponding to those time steps, it's
able to produce a character of C and then
it feeds in the character C to itself and
then produces the next character which
is A after changing the attention.
The attention now looks further down
from where it was looking at the first
time step and then you feed in
the character A into your model.
And again, you recompute attention and
it automatically just moves
forward once its learned.
So if you keep doing this
over the entire input stream,
then you the moving
forward attention just
learned by the model itself.
So here, it's producing the output
sequence cancel, cancel, cancel.
The question was, are we no longer doing
predict the previous token or the break?
So this is a different model,
which is a sequence to sequence model,
you feed in the entire data as an input
conditioning and there is no notion of
consuming a little bit of the input and
then producing output.
Instead, the entire input is
looked at every time step.
And so you don't really need
to add breaks anywhere.
You just produce one token,
then you produce the next one condition
on the last token you produced.
So going back, it's essentially
doing next step prediction.
So it's just doing next step
prediction in this model,
you have a neural net which is the decoder
in a sequence-to-sequence model
that looks at the entire
input which is the encoder.
It feeds in the path
symbols that you produce,
because it's a recurrent neural network.
You can just keep feeding symbols and
the length issue does not arise.
So you fed in the path symbols as
the recurrent neural network and
then you're just predicting the next
token itself as the output.
I need to switch a third time.
This is the second last one, I promise.
So, how does this attention model work?
So firstly,
you have an encoder on the left-hand side.
It seems to have a special structure.
I'll go into that in a few slides.
For now, just forget the fact
that it has a special structure.
And just remember that for every time step
of the input, it's producing some vector
representation which encodes the input and
that's represented as ht at time step t.
So you have hidden vector, ht.
At time step t and
you were generating the next character
at every time step with the decoder.
So what you do is you take
the state vector of the decoder.
So bottom layer of the recurring
neural network that is the decoder.
And you now compare
the state vector against
each of the hidden time
steps of the encoder.
And so semantically what that means is
you kind of have this query in mind
which is this state S.
And you have places ht that you're
looking at as possible places where
the information is present.
So you take this query and
you compare it to every ht.
You could have done something very simple
like take a dot product in which case,
the vectors don't have to
be the same sort of size.
Or you could have done something
much more sophisticated,
which is you take the hidden state
that you want to compare the query
against concatenate them into a vector.
And then put them in a neural network
which produces a scalar value, and
turns out that's what we do.
So you basically, have this function here,
function f which takes in a concatenation
of the hidden state at a time step t.
With the state of the recurrent neural
network which is the decoder state and
then produces a single number e of t.
Now, you do that for
every time step of the encoder and so
you have a trend in time
in the encoder space.
And that's kind of like
a similarity between your query and
your source from the encoder.
So you get this trend eft, and
of course these are just scalers.
Yeah, you want to keep these
magnitudes under control.
So you can pass them through a softmax
which normalizes across the timesteps.
And so you get something that sums to one.
And that what's called the attention
vector that turns out showing you what's
basically the trends of these attention
vector as the query changed overtime.
So every timestep,
you got an attention vector which shows
you where you look at for that timestep.
Then you move to the next timestep, you
recompute your new attention vector, and
you do that over and over again.
So now that you an attention vector,
what you can do is now
use these probabilities over timestep
to blend the hidden states together.
And get one context value which is this
representation that is of interest to you
in actually doing the prediction for
that time step.
So here,
you would take all the hidden states and
the corresponding attention value.
And just multiply them and
add them together and
that gives you a content vector.
And this content vector is really
the content that will guide the prediction
that you make.
So you take this content vector,
you can catenate that with
the state of your r and n.
And you pass it through a neural net and
you get a prediction at that time step.
And this prediction of course is the
probability of the next token, given all
the past tokens you produced and all the
input that int was fed into the encoder.
This is exciting for me,
I don't have to switch after this.
Okay, so now what's this funny
business with the encoder?
You're used to seeing
a recurring neural network,
which basically proceeds at
the same time step as the input.
So you got an input you process
with through some hidden stage.
You passes through one occurrent step and
you move on.
We found that for when we did that for
audio sequences that we're really long,
such as realistic speech in
Wall Street Journal which is
one of the speech corpora.
It was just not able to learn
a very good attention model.
Things just wouldn't go off the ground,
and
that makes sense because you have
a lot of timesteps to look over.
So typically, you'll get something like
7 seconds which would be 700 frames and
you're doing a softmax over 700 timesteps.
It's hard for you to sort of
initially learn where to propagate
the signal down to,
to predict what the token is.
So it never really catches on very fast.
So this hierarchical encoder
is a replacement for
a recurring neural network.
So that just instead of one frame
of processing at every time step,
you collapse neighboring frames
as you feed into the next layer.
What this does is that every timestep,
it reduces the number of timesteps that
you have to process and
also makes the processing faster.
So if you do this a few times, by the time
you get to the top layer of the encoder.
Your number of timesteps has
been reduced significantly and
your attention model is
able to work a lot better.
So here's some example.
Output that this model produces.
And I specifically want to out
the outputs are very multimodal.
And what do I mean by that?
So you have an input the truth is
call a a a roadside assistance.
The model produces call a a a roadside
assistance as the first output.
But it also produces call
triple a roadside assistance.
So the same audio can be used to produce
very different kinds of transcripts.
Which is really the power of the model and
says how this model can actually
learn very complex functions.
And actually solve this task with just
one model instead of requiring many.
So interestingly if you look down,
you'll see
the reason why this model is able to
produce call a a a, and call triple a.
Because the training set
has a lot of call xxx.
And so the model learns a very
specific pattern in a table to sort of
transfer that to a different domain.
Another aspect of the model is causality,
and what do I mean by that?
Here, you have an output which is St.
Mary's Animal Clinic
which is the transcript.
So if you look at the attention over
time when you produce the first token s.
It just look said this little blob at
the top, and then when you look at t,
it looks moves forward at
the pretty center of the character.
And the models per multimodels, so
instead of just producing st marries.
It can also produce Saint Mary
which is a truly different
transcript from the same audio.
And now if you look at where the attention
goes before would produce st.
And then start moving
forward when Mary came along.
Here, it's the same word saint.
So it actually dwells at the same
timestep in attention space.
So that's a notion where,
whatever symbol you've produced,
really affects how the neural network
behaves at the next few timesteps.
And that's really a very strong
characteristic of this model.
Okay, the question is the fact
that the model produces
two different transcript
a result of the fact that
there's ambiguity in
the pronunciation model itself?
So I think that is essentially
what the model tries to capture.
The fact that the same word can
be pronounced multiple ways or
that the same pronunciation can be
written out multiple ways are sort of,
Two different, but related, problems.
One is different pronunciation
producing the same token,
which does not require
multimodality in a model
as long as one source of sound
only produces the same token.
What's happening here is
more interesting in that
the same sound can be
written out in multiple ways.
During training, clearly it must have
heard both sides, it's heard saint written
out as S-T and another time it must have
heard saint written out as an S-A-I-N-T.
So you need it in the training data but
what's nice is the model's really powerful
enough to realize that the same
sound can actually do this and
it can actually produce
very different transcripts.
So when we did this about a year and
a half ago,
these are old results,
things are more exciting now.
We found that our model was when
you didn't use a language model it
produced a word error rate of
around 14% whereas the best system
that we had in Google at the time was 8.
At this point I would say I was
an intern at Google many years ago,
and then the word error rate was 16% and
that was a result of 45 years of work.
Where people had customized all the speech
recognition components very carefully for
all these years.
And this model by one, just one
single model that just goes straight
from audio to text gets a lower word error
rate than what we were getting in 2011.
So that, I think, something to write
home about which is pretty exciting.
Turns out it still benefits from
having a language model, so
if you feed in a language model the 14%
word error rate comes down to 10.3.
So obviously, there's no substitute for
billions and billions
of written text sentences in trying to
disambiguate speech recognition better.
But just the basic model
by itself does very well.
So, now what are the limitations
of this model?
One of the big limitations preventing
its use in an online system is that
if you notice, the output is produced
conditioned on the entire input.
So you get this next step prediction
of xt plus 1 given all the input x and
all the sorry,
yt plus 1 given all the input x and
all the tokens you've produced so
far which is y1 to yt.
So it's really just doing
next step prediction but
the next step prediction is
conditioned on the entire input.
So if you were gonna try and put it
in a real speech recognition system,
you'd have to first wait for
the entire audio to be received before
you can start outputting the symbol.
Because, by definition,
the mathematical model is the next
token in this condition on the entire
input and the past tokens.
Another problem is that the attention
model itself is a computational bottleneck
for every time stamp, you have to
look at the entire input sequence.
So there's this comparison as
long as the length of the input
which makes it a lot slower and
harder to learn as well.
Further, as the input receives and becomes
longer, the word error rate goes down.
This is really an old slide,
this doesn't happen much anymore.
But I'll talk about the methods we came
up with to prevent this a little later.
So I'm gonna now switch gears to
another model which it's called
the online sequence to sequence model.
This model was designed to try and
overcome the limitations of sequence to
sequence models where we don't want to
wait till the entire input has arrived
before we start producing the output.
And also wanna try and avoid attention
model itself over the entire sequence,
because that seems to
be an overkill as well.
So you want to produce outputs as inputs
arrive and it has to solve this problem
which is, am I ready to produce an output
now that I've received this much input?
So, the model has changed a little bit,
not only does it have to produce a symbol,
it has to know when it has enough
information to produce the next symbol.
So, this model is called the a neural
transducer, and in essence,
it's a very simple idea.
You take the input as it comes in, and
every so often at regular intervals,
you run a sequence to sequence model
on what you received in the last block.
And so you have this situation here
where you basically have the encoder.
And now instead of the encoder
attention looking at the entire input,
it just looks at this little block, and
this decoder which we call the transducer
here will now produce the output symbols.
Now notice that since we've locked up
the inputs, we have this situation where
you may have received some input,
but you can't produce an output.
And so now we're sort of, we need this
blank symbol back again in this model.
Because it really can be the situation
where you got a long pause,
you haven't heard any words, so you really
shouldn't be producing any symbols.
So we reintroduce this symbol called
the end-of-block symbol here in this model
to sort of encapsulate the situation that
you shouldn't be producing any outputs.
One nice thing about this model now,
is that it maintains causality.
So if you remember the CTC model,
it also had this notion of
not producing any outputs.
But when you produce these symbols,
you did not feed back what
you had produced in the past.
And so it didn't have these notions
where the same input can produce
multiple outputs.
And it didn't have the notion of causality
where depending on what you produce so
far, you would really just change
the computation they're on.
So here in the neural transducer,
it preserves disadvantage of
a sequence-to-sequence model.
And it also, of course,
now introduces an alignment problem,
just like these slides have
an alignment problem too.
So, in essence,
what you want to know is you have
to produce some symbols as outputs.
But you don't know which chunk
should these symbols be aligned to.
And you have to solve that
problem during learning.
I want describe this very carefully but
I'll make a go off it.
You have some output symbols,
y1 to S that have to be produced, and now,
if you have an input that is B blocks.
You can now output these S symbols along
with B end-of-block markers anywhere
to describe the actual alignment in
the way the symbols are produced.
And of course there's a combinatorial
number of ways in which you can align
the original input to
the actual block symbols.
So the probability distribution
turns out to be the probability
of y1 to yS given x is modeled as
the sum over all the different
ways in which you can align y1
to S to the original B blocks.
And the B, the extra B in length comes
from the fact that there's B blocks and
an H block ends with
an end-of-block symbol.
So now it's similar to CTC,
you have some output sequence.
You can produce them from
a bunch of different ways.
And all of those ways
have some probability and
if you have to learn in
spite of that model.
Unlike CTC, of course, this model is
not independent at every time step.
Once you make the predictions you feedback
your previous tokens that changes
the entire probability distribution
at the next timestamp.
What this means is there no decomposition
between different parts of the input,
given the data.
So you can’t really do a dynamic
programming algorithm that just simplifies
this computation.
So, we came up with a simple way of doing
this approximation of the sum, which was,
let's just find the best possible
alignment given your current model.
So, you basically try and
do some kind of a beam search.
And you find the best paths as the output
and then you use that during training.
Okay, so
sorry one more point that I should make.
That's the same process
we used during inference.
The model itself is you want to
produce these symbols, y1 to ys,
you can do it in a variety of ways.
During inference you find the best one and
you go with that one as
being the actual transcript.
During learning if you take
a gradient of that combinatorial sum,
it comes down to this particular
form which boils down to saying.
Give me a sample from all the probability
of aligning the outputs given the input,
and then I will train the log
probability of that sample.
If that sounds like Greek I wouldn't
worry too much about it but
I'll say it one more time.
It's basically this happens in
cases where models are a sum
of combinatorial of terms.
If you want to optimize such a model you
basically have to take the gradient of the
log prob and the gradient of the log prob
turns out to be a sum of the log probs
over the posteriors of the samples.
And that's the case in this model as well.
Of course this is really
hard to optimize and so
we replaced this entire really
complicated procedure for
optimization by giving an output
symbol find the best alignment and
just make that alignment better,
it's kind of like a terby sort of trick.
So I'm just gonna skip this part.
Okay, the finding of the best
path is kind of interesting.
So I'll cover this part.
Turns out if you want to find the best
path there's also a combinatorial
number of ways, and so what you can
do is kind of do a beam search,
where you keep a bunch of candidates
around and then you extend them and
as you extend them you now re-rank
them and throw away the best one.
However, turns out if you do beam search,
it doesn't really work.
And so what we discovered
was a Dynamic programming,
which is approximate that works very well.
And the way this works
is you consider the best
candidates that are produced
at the end of a block from
producing either j-1 tokens or j-2 tokens,
j-1 tokens, or
j tokens at the end of block b -1.
So you know that if I wanted
to produce j-2 tokens at
the end of the previous block,
what's the best probability?
And now,
that corresponds to this dot here.
So from that dot you can now extend
either by one symbol or by two symbols or
by three symbols, and you get
different paths that reach the source.
And so now when you're considering
the different ways of entering a source,
you just find the best one and
you keep that around.
And you then now extend those
ones to the next time step.
It's kind of an approximate procedure,
because this ability to extend
a symbol is not Markovian.
And so if we take this max as the max
of the previous step extended by one,
that maybe wrong because the correct
path maybe two steps away and
that would've been better.
However, it seems to work very
well to find an alignment that
trains the online
sequence-to-sequence model properly.
So some results on this model,
if you change the window size,
that's how the block is constructed.
You find that using different block
sizes when there's an attention model,
mixed model work very well.
So in these blocks we can have
attention instead of just running
a sequence-to-sequence.
So it's not affected by
the window size of the blocks.
And those are these lines at the bottom.
It also turns out that you
don't really need attention.
If the window size is small
which side steps this problem of
doing an attention over
the entire input sequence.
And that's where we're really trying
to get that is to try and get the model
that could do sequence-to-sequence
output symbols when it needs to but
really, not have to do all these
computation with respect to the length.
So that was basically the online
sequence-to-sequence model.
I wanna touch a little bit about how you
can make the sequence-to-sequence models
better themselves.
One of the things that people are doing
these days borrowing from vision is to use
convolutional neural networks.
So in vision related tasks, convolutional
neural networks have been very powerful.
Some of the best models for
object detection and
object recognition use
convolutional models.
They are also very effective in speech,
so, we tried to do this architecture and
speech for the encoder side of things.
So, you take the traditional model for
the pyramid, and
instead of doing the pyramid by
just stacking two things together,
you can actually put a fancy
architecture when you do the stacking.
So don't just stack two times, step,
and feed it to the next layer, but
instead stack them as feature maps and
put a convolutional neuronetwork on top.
I think you guys have not been exposed
o convolutional neuronetworks yet,
but let's just say it's a very
specific kind of model that looks
at some subset of the input
instead of the entire input.
And so the subset that it looks at
has to be matched to the structure.
So if you are in a task such as vision,
an image patch is natural structure or
substructure to look at
instead of the entire image.
For speech, also if you look at
the frequency bands and the time stamps of
the features, that corresponds to
a natural substructure to look at.
So convulational model will
just look at the substructure.
So what we did in this work was to say,
okay,
now we're gonna change this
computation which is a pyramid and
add a lot of depth to it by adding this
convolutional architectures in every step.
So, in the past it was just a simple
linear projection of two time steps
together, but now it's a very deep
convolutional model which of course for
deep learning experts is great, because
the believe the deeper the number of
nonlinearities the better it is, and
this model actually has a lot of that way.
When we did that,
we found very good results.
If you take a baseline on a task
called Wall Street Journal,
it goes from something like 14.76
word error rate down to 10.5
just by using this very specific trick
on how to do these convolutions.
So deeper continues to be a good model.
Okay, now switching to what is
the output space that's a very
appropriate one for speech.
So in translation,
what happens is there's multiple
ways people have discovered on how
to encode the output sequence.
You might produce character sequences,
you might produce words and
character sequences, or you might
produce a subset of characters and
use that as the output vocabulary.
In speech, that seems not natural because
what you want to do is you want to
produce output tokens that corresponds
to some notion of the sound that was
being produced in the input.
So what you would like to do
is change the vocabulary.
So it's not just characters but
maybe bigrams or
trigrams of characters that
corresponds to some audio token.
So basically these are,
I guess the slide is talking about
the different ways to do it.
As I said you can either represent
the word or the characters or
the word and characters, but for
speech you want to use N-grams.
However there's a problem here, should we
decide the end graphs before hand and
then just use those during training?
That kind of defies the end to end model
where we want to actually learn
the entire process, as one big model.
So we decided okay,
what we could do is build this
vocabulary which is unigrams,
bigrams, trigrams, and
whatever number of N-grams of characters,
and put them in a softmax.
And now the problem arises,
if you have a word like hello,
it can be decomposed in multiple ways.
Might spell as H-E-L-L-O, or
it might spell as HE-L-L-O
if HE happens to be in a target
set that you've chosen.
So it's really indefine,
undefine problem or it's a problem where
now you have to deal with multiple
ways of out putting the same sequence.
So how should we make this choice?
One way of making this choice is
you get a word such as CAT SITS.
You just look in your token space.
If you have CA in your tokens, you just
say I'm gonna choose CA as the input.
The you do T and then SI and then T and S.
So this is just very greedy in terms
of how you produce the tokens.
Another way is to look at the compression,
the sequence of tokens that
have the highest probability.
Here, it's basically about reuse,
it's like encoding.
And you would just sort of use
a minimum description line.
So AT happens to be a lot
more frequent than CA, so
you would rather choose AT as a token.
In this case, the decomposition for
CAT SITS would be C, AT, SI, and T, S.
That would be another way of course,
it's not clear for
the audio which is the best way.
So our approach was try to
learn this automatically.
So you have some output y*,
which is the correct output sequence,
and you try out all the possible
decompositions of the same output.
So you basically look at all possible
ways of producing the token.
And when you do the learning, you take
the gradient of all possibilities of
producing the output sequence and
propagate that error signal down.
Just to know when this class end?
If you look at how this model performs,
turns out it helps to use
larger N-gram pieces.
So if you take a character
based model which was just CTC,
with no language model,
it would have 27% word error rate.
If you take the sequence to sequence rate
of the last model with character output,
it produces 14.7 word error rate.
If you used a 2 gram it does better,
with a 3 gram it does it even better.
The 4 gram does better on training, but
worst generalization presumably because
our dataset was really limited.
It's just 81 hours of data and once you
start using larger and larger tokens.
You can imagine it doesn't
enough evidence for
out lot of the longer tokens,
similarly for 5 grams.
To show you an example, the actual
test is shamrock's pretax profit
from the sale was $125
million a spokeswoman said.
The character model produces
shamrock as C-H-A-M-R-O-C-K.
The 4 gram model will take the sh sound
straight up as SH, which is nice.
And then does a lot of these
things with single characters.
But you can see common and bigrams and
trigrams being used as a result of this.
If you look at whether or
not the model is actually using it.
Numerically, you find out if you had
trained the model by just one kind
of algorithm,
where you just took the maximum extension.
It used the N-grams more, because it
trained to use these longer N-grams.
However, if you look at
the results that come out,
where you actually learned
to use the N-grams.
It still does a better job, does
a good job of learning the N-grams and
it gets a lower word error rate.
So that's quite promising in terms
of achieving what we wanted to do.
So now I'm switching gears here and going
into some of the final shortcomings of
sequence to sequence models when they're
applied into speech recognition.
If you look at the transcripts
that are produced
in terms of the probabilities of every
token, you find an interesting pattern.
Here, at the top is the actual sequence.
So South Africa the Solution by
Francis Candold and Leoguen Low.
It's actually this is
not the right solution,
this is the highest probability one.
Below each token you have the alternate
tokens that had a probability which
was some thresholds lower than
the probability of the best token.
So if there's no tokens below a token,
that means there's no ambiguity.
It's really sure that that's the token
that's got the right answer.
If there's many, that means it's a little
confused at this part of the token when
it's producing the next token.
So you find a very interesting trend that
there's a lot of ambiguity at the start at
the first characters.
But as soon as you produce the first few
characters there's very little ambiguity
as to what the next characters are.
Unless it's things like names, so
Francis here, there's some confusion
on how to sound it out and Candold
has some probability issues as well.
So this might seem surprising but
it's natural.
If you're doing language
modelling on a character level.
Once you have the first
few characters of a word,
you pretty much know what the word is.
And so what you really wanna do is produce
these, be much more discriminative
at the starts of words instead of,
because that's where you'll make an error.
If you make an error at the start of a
word, you're never gonna recover from it.
And so if we want to fix this problem,
we need to sort of address this issue.
The repercussion of this problem is that
if you're over confident about the wrong
word, not even a language
model can help you.
Because you've basically decided early
on what the word is going to be.
And you need very precise language
model probabilities to kind
of get you out of the rut.
So we've found that there's this little
technique called entropy regularization.
Which prevents your softmax from ever
being too confident that really just
solves this problem.
So every time you predict
the next character,
you make sure you're not becoming
probability of one for one symbol.
Instead you say if you're getting to
confident and penalize it, that you're
force to spread the probability of
distribution to the other characters.
Once you do that,
this problem really just goes away.
And the baseline model that we
had just improved massively.
So if you remember,
we had CTC on an intent task on
Wall Street Journal which
had some error like 27.3.
Then we had, there's a baseline sequence
to sequence model that wasn't ours.
But Yoshua Bengio's group that
had an 18.6 word error rate.
Our baseline for some reason was 12.9
where error rate then once we applied
this technique of entropy regularization
that error rate went down to 10.5.
There's different ways by which
you can do this regularization.
One is, you just do entropy.
And other is, you say the probability
distribution must look like
the unigram probability
distributions of the alpha tokens.
And that seems to work better than just
doing fully entropy regularization.
So that's one problem.
Another big problem that arises
during decoding is this lack of
generative penalty.
I think there was a slide
in your translation
lecture which talked about
this in a different setting.
But I'll talk about it
in the context of audio.
When you have a very
long input sequence and
you're decoding it one character
at a time, what you're
doing is you're comparing your hypothesis
against all alternative hypothesis.
So, every time you produce a new token
you pay a cost for that extra token.
If your input's very long
then you're gonna obviously
associate it with a long input,
have to produce a lot of tokens because
probably the transcript that
you're producing is very long.
So let's say you have to
produce 100 transcripts and
you're paying an average cost of one.
That means you're gonna pay a cost of
100 for even a correct transcript.
Now In your beam search,
you'll probably get some examples
which say the end-of-token symbol has
a probability less than minus 100.
I think this is a very subtle point but
the upshot of it is,
your model thinks it's okay
to terminate an output
without even looking at the rest
of the input when it's not.
And the reason this happens is,
the model has no notion of
explaining the entire input.
And because it doesn't have to explain the
entire input, it just terminates early.
And very often you'll produce very short
output sequences when you should be
producing very large output sequences.
So to give an example,
if the output transcript is,
chase is nigeria's registrar and
the society is an independent organization
hired to count votes, If you look at
the language model cost, it's minus 108.
If you look at the model cost it's,
this is just the model from
the last model which is minus 34.
And you look at the other alternatives,
you get chase is Nigeria's registrar
which has a low cost of minus 31.
So it's just happy It just produce a short
token, instead of this really long one.
If you look at chase is Nigeria's or
chase's nature is register,
that also has a small probability.
In fact the best probability is just
to produce nothing, which is -12.5.
So, this is tying up in
the issue where discriminative
models don't explain the entire data and
so they can make such errors.
What we found worked quite
simply was train as usual but
during test time add a little
penalty which says I'm gonna try and
make sure the sum of probabilities.
Over a given output time step
is greater than a threshold.
So look at all the frames of the input,
has someone looked at them or not?
If there's enough frames that
nobody's looked at them or
at least with some threshold tell,
then you pay a cost for that.
And when you do that all these other
hypothesis that terminate early are now
paying a lot of cost for every frame that
did not explain and so they fall down and
our re-rank the out.
So we need do this little trick,
our model now really performs quite well.
This model for sequence to sequence is now
able to get six point seven word error
rate on the Wall Street Journal.
Which is the lowest for end to end models
including CTC with a language model.
It's very very promising in that sequence
to sequence can achieve such low numbers,
with such low amounts of data.
I should point out that this
really is changing the model.
You train a model with one objective,
but during test time,
you really fiddled with the loss
that you claimed was the best one.
So, technically, there's something
wrong with the model that needs fixing.
So finally,
something very relevant to an NLP class is
what do you do about language models here?
No matter how much audio you have,
you'll always have more text.
And that text can be really
useful in correcting up errors
that this model will have.
So I could train on 2,000 hours of data,
but
that would just be about 3
million small utterances, and
the language model you learn from
that is not gonna be very powerful.
So the speech organizer is going
to make a mistake no matter what.
Since I'm an Nvidia now,
I cannot take claim, for that not working.
>> [LAUGH]
>> So the question arises,
how can we do better
language model blending?
In these models cuz they're
really end to end models and
you just are basically training the entire
tasks you're doing an acoustic model.
You're doing a language model all in one
model, and now suddenly you're saying
hey can I please revert back and find
ways to add my language model in here.
Well one obvious way is to add
the law of probabilities every time
you do a next stop prediction.
You make a prediction for the next time
stamp, and then you blend in a language
model prediction with it, and then
hopefully that fixes a lot of the errors.
There's some cool work from
Yoshua Benjio's group which tries
a bunch of other tricks to do this, which
they call shallow and deep fusion models.
It's basically an extension of the simple
idea that I've just described but
it kind of does a little bit more
in how the blending is done.
Instead of just blending the actual
feature, the actual log probs,
it uses a model that does a linear
projection, and it learns that together.
I think fundamentally this is
an interesting approach where
you basically have two streams of
predictions and you combine them.
However, one of the things that's
lacking about this model is that
the model in this case,
it's for translation.
The translation model predictions
don't actually affect the internals
of the language model and visa versa.
What you would like to
do is have a model that,
where diffusion actually means changing
of all the features that are computed,
rather than just summing the logic,
but it's a pretty good start.
I apologise, I forgot what the slide was.
So when you're producing very long
sequences with next step prediction,
what it's doing is just
looking at the next token, and
I highlighted why that can be a problem.
For example, when you producing words
If you produced two characters,
the next one is almost necessarily
just decided right away.
So when you have a lost function
that's just looking at the next step,
it doesn't have a very long horizon.
When it makes a mistake here,
it might just go into a wrong path
that it can never correct out of.
So, people have been looking
at how to fix this problem.
One of the ways is scheduled sampling.
And I think you guys
have looked at this or
at least talked about
this in the last lecture.
But what it does is, instead of taking
the model and feeding in the ground
truth at every time step, you feed in
the predictions of the model itself.
Or you sample from the model.
So, what you're learning is,
during test time I'm going to screw up and
feed in something wrong.
At training time, let me do the same so
that I'm resilient to
that kind of mistake.
What that does is it's actually
changing your model so
it respects long range
structure much better.
There are other methods also which would
be based on reinforcement learning that
optimize the word error rate directly.
I wouldn't talk about this at all, but
other than to say that it's another way
of letting your model roll forward and
generate a bunch of different candidates
and than compute an error based on
the candidates that you've computed.
A very cool paper is this one,
called Sequence-to-Sequence
as Beam Search Optimization.
And this kind of also runs your model
forward, but with some interesting
tricks where it runs the model forward
until you've made some mistakes, and
then it stops and does it again.
So finally, what can we do with this
model that we couldn't have done before?
There's some things that you can do,
one of which is multi-speaker,
multi-channel setup that
people don't really do yet.
So, flashback to some years ago,
the motivating problem for
speech recognition was this thing
called the cocktail party problem.
You wanted to be in a room walking around
and listening to a bunch of things
happening and get your recognizer
to produce all the entire output.
Somewhere along the way, people
forgot about that particular problem.
And they've been focused on just single
microphone set-ups where you're stationary
with respect to another speaker and
you kind of produce only
one output as a transcript.
I tell the reason this happened is
because you have a generative model,
traditionally, which has
some transcription mind and
then it generative describes the data.
And that's not very natural way to sort
of do the inverse problem where you
can mix in a bunch of people
in many different ways.
The inverse problem is just follow
track one particular individual and
that's much easier to do than try and
sort of invert a generative model
where you have a bunch of sources.
So a model, such as this, such as
sequence model should work very well
in trying to do multi-speaker
multi-channel set up.
And then there's this really cool
paper that came out recently,
which talks about direct translation and
transcription at the same time.
So, it takes in audio in French and
it produces
English text as an output with just one
sequence-to-sequence model, which just
blends the last model and a translation
model together, which is quite exciting.
If you look at their paper, it's got
this really exciting attention plot.
So if you take the neural machine
translation attention model,
they're translating how much is
the breakfast to com bien coute
le petit dejeuner, whatever.
In French, it's how much does
the little breakfast cost, or whatever.
If you look at the attention model,
it's looking at the right words.
How it pairs with combien,
which means how in French and
much pairs with the two of them as well.
If you look at the corresponding
attention model from the wave form and
the text,
it also focuses on the same sort of parts.
Even through it was trained in a different
way, with a different modality as input.
So, it's really cool that is able
to learn, to focus on the audio,
even though it’s translating
at the same time.
And so, I'd like to end here.
There's some acknowledgements.
Most of this work was done at Brain,
at least the ones I was involved in.
And I've had the luck to work
with some phenomenal interns.
Yeah, Google Brain gets a very fantastic
bunch of students going by and
I'd be lucky to work with them.
And the Google Brain team is
a phenomenal place to work at.
So, I want to thank them.
>> [APPLAUSE]
