[MUSIC]
Stanford University.
>> Network, there's actually a whole
class, I think next quarter,
on just networks for computer vision.
Where they've really
changed the entire field.
In NLP, they've had some impact but
not as much,
which is why we don't have
the entire lecture on just CNNs.
But they are an interesting model family.
They are paralyzable, they're very good
on GPUs, and so we'll sort of look into
them in detail today, understand hopefully
by the end why they're so useful.
And fast to implement on GPUs but really
also give you at least some intuition,
to be honest there's much less intuition
behind some of these very advanced CNN
architectures compared to even some of the
recurrent networks and LSTMs that we had.
So we'll actually start today with
a mini tutorial of Azure and GPUs,
we wanna encourage you all to really get
started on that as soon as possible.
Also, thanks everybody for filling out
the survey I think this one is one
of the important take away messages, which
is overall what do you think of the pace.
We're very happy to see that the majority
are quite happy with the pace.
It's kind of impossible with such
a large class to not be too fast and
not too slow for 100% of everybody, since
people have vastly different backgrounds.
Very sorry for
the little less than a third I think, for
whom it's too fast,
I hope today will be, not quite as fast.
And hopefully, in office hours and so
on we can make up for some of that.
So we'll talk about a couple
of different CNN variants.
We'll have a fun research highlight on
character-aware neural language models.
And then, we'll actually also look
a little bit into tips and tricks that
are slightly more practical, and you'll
observe that these practical details and
tricks actually making this particular CNN
architecture work are super important and
without it you really lose 10% or
so of accuracy.
Look at it a little critically.
At some of the evaluations that
are going on in the field, and
then I will compare a couple of
different models which will lead us to
a very new model called the
quasi-recurrent neural network for treaty,
which just came out
a couple of months ago.
With that, I’ll do one organization
slide before we go onto Azure.
So project advice office hours, I would
really encourage everybody who’s doing
a project to now come to project
advice office hours every week.
I’ve asked groups that I’m
mentoring personally to also
As a server requirement.
Not all the groups were
able to come every week.
I encourage you all to come.
I am keeping track of
whether you're there.
So also for everybody who basically is
still undecided whether they should
move on with their project, you'll
see kind of where PA4 folks should be
at in the next week or so, where you
have to have run some baselines on
your data set by now if you're
doing your final project.
If you don't even have your dataset
ready yet, you can't even run a simple,
let's say, bag of vectors, kinda baseline,
it's starting to be really worrisome,
so definitely make sure you
start running your experiments.
Some simple things, baselines, could be
just any, just could be your regressions.
You download some code somewhere, but
you need to make sure you
have your data set ready.
Otherwise, it'll be too late.
And for
PA4 folks we actually enforce that with
a little additional deadline just
to make sure you're really all.
Going to be able to run it cuz this is
not one of those things that you can
cram really hard and you work 10x and
so you make 10x to progress
because your experiments will take a day
to run and so you run for one day.
Turns out at the end you had a bug and
then the deadline was there and
you have nothing.
So it happens every year.
And we really want to make sure
it doesn't happen this year even
though we're a bigger class.
So we'll talk about that soon.
Also, in terms of positive motivation,
there's actually going to be a really
awesome poster session that
we're putting together,
we have corporate sponsors
that give us some money, and
they will allow us to basically give
out price, have prices for you.
We'll make it public, so
hopefully a lot of folks will show up and
check out your research.
It's a lot of excitement both from various
companies, VCs, if you have a really
awesome poster, who knows, at the end you
may have some funding for your start up.
And we'll have food also, very nice
catering, so should be really fun poster
session, so hopefully you can be very
excited about that and your projects.
Yeah?
Will there be enough food for everybody?
>> [LAUGH]
>> It’s a good question.
We’ll spend thousands and
thousands of dollars on food we hope
there will be enough food for everybody.
Schein is organizing it she's nodding.
Yes.
Any other organizational questions
around the Poster Areas Project.
All right,
then take it away on the GPU side.
>> [INAUDIBLE].
>> Nope, you're good.
>> All right, everyone.
This is just intended to be a short
public service announcement basically
about how to get started with Azure and
why you should get started with Azure.
By now, every team should have received
an email to at least one of your team
members, probably to your Stanford, one of
your Stanford emails, and you'll have this
message which is basically an invitation
to join our CS224N subscription.
And using following the instructions
of this email you should sign up for
basically GPU access.
So far only 161 people have signed up or
teams have signed up out of the 311,
and essentially we want this number
to increase because everyone
should be using GPUs for
reasons that we'll cover very shortly.
And if you have any issues signing up,
then please report the problems
that you have to Piazza post 1830,
which has the form, also screenshotted
there and we'll help you, essentially,
through any of the problems that you
have with their subscriptions, cool.
So then, more important question that
we're gonna go over is why should you
really care about the GPUs.
Well, first, yesterday we actually
announced the milestone for
the final project and the homework.
It's intended to be something
very quick and easy,
just a paragraph of what you've done.
But we expect you to have used I always
experimented with running some code on
a GPU like that and
this will be worth essentially 2% of your
final grade just if you do it or not.
But really down there the better reason
of why you should be using GPU's
is GPU's will train your models much, much
faster over a much, much larger data set.
And specifically,
Microsoft has offered us,
I think 311 MB6 instances
on their 0 cloud.
These use Tesla GPU's, M60,
if you're interested in the model.
The specifications are,
they have a huge number of CUDA cores,
a huge amount of graphics memory, and
they cost a huge amount of money each.
You also get a nice CPU,
as well as a lot of system memory,
to go along with your Instance.
And the key takeaway here
is that these ar not your
average hardware that you
have in your local machine.
There's gonna be way more power, in terms
of the CPU, in terms of the GPU, and
in terms of well, even for the gaming
whatever hardware you have at home.
And the speed-ups will be 10 to 20,
maybe even 100,
depending on the libraries
that you're running.
So in conclusion, please do get started
on Azure as soon as possible, fill out
the form if you run into subscription
issues, come to office hours or
file support tickets if you have technical
problems such as not being able to etc.
And then, also see our step-by-step guide
to just get started with the process.
For Homework 4,
The full assignment handout will go
over essentially all the details.
But decent models will
take a long time to train.
They'll take one hour plus per epoch,
even on a strong GPU,
such as the previously described ones.
If you don't deal with a GPU,
you'll be spending a week, basically,
just to train a baseline model.
And for the final project, if your project
is sufficiently challenging that needs,
well, you have enough data or your problem
is sufficiently challenging, you really do
want to use a GPU if you want to
receive a good score in this class.
And that would be all.
>> Cool.
Thanks, James.
And by decent model he also means
decent implementations, so if your
implementations isn't super well-optimized
it will take you even longer.
So again, not something you can cram on
in the last couple of days of the class.
All right, any questions about Azure or
[INAUDIBLE]?
What if we're not in the same group
between homework three and four?
So recurrent neural networks were
pretty awesome and are pretty awesome
actually and a lot of times the default
model, but they have some issues.
Namely, they can't really
capture phrases in isolation.
They can really only capture a phrase
given it's left side context.
So what do we mean by this?
If I want to have just a representation
of my birth, in this whole sentence,
well recurrent network will
always go from left to right.
And so, that phrase vector up there
isn't going to just capture my birth,
it will also capture the country of.
And so, sometimes when you have
simple classification problems
you might actually just want to
identify that there's a certain word or
phrase in that over all document and
just try to give
the fact that that phrase exist in your
overall document to somewhere higher up in
the final classifier that
actually needs to classify this.
But here,
you will always go from left to right or
even if you have a bidirectional one,
you go from right to left.
But then you have the same problem,
but on the other side.
Namely, the intermediate,
the words in the center of
a longer document might get lost.
You really have to keep track
of them through every iteration.
And, of course, if you're using
LSTMs are better at doing that,
they're better able to say,
don't turn on the forget gate,
keep some things around, keep certain
units on when you see something.
But it requires a lot of the model
to be able to do that perfectly.
And so, in many of the cases you will see
your classifiers only at the very end,
once it has read the whole sentence and
that is not the issue cuz now,
again, the grading has
to flow through this.
And despite all the [INAUDIBLE] and LSTM,
it's even hard for them to keep very
complex kinds of relationships alive
over many, many different time steps.
So that's one issue with RNNs
that CNNs are trying to resolve.
Now, the main idea here is instead of
computing a single representation of
vector at every time step that captures
basically the context on the left so
far what if we could just
compute a phrase vector for
every single phrase that
we have in this sentence.
So if we have here the phrase a country
of my birth, we might compute
in the very first step of this kinds
of convolutional networks if vector for
the country just this
two words in isolation.
Just country of my birth so
basically compute a vector for
all the by grams in the sentence and
then another one maybe for
all the trigrams,
country of my birth, the country.
And then, for all the fourgrams,
the country of my birth.
So hoping that if this was, for instance,
sentiment classification, that one of
these said, not very good, for instance.
And then, if we captured that vector and
we kind of will try to eventually.
Handle and push that vector all
the way to a softmax through some
other forms that I'll describe soon.
But that is basically the idea
of the very first layer
of a convolutional network for NLP.
And this will basically compute
these phrase vectors regardless of
whether that is a grammatical phrase.
So we know from parsing, for instance,
certain phrases like a country
of is not really a proper noun phrase,
it's sort of an odd,
ungrammatical chunk but this motto
really doesn't care linguistic or
cognitive possibility in any
kind of way for language.
And so, people don't read sentences
that way, but you might be
able to eventually compute several of
these representations in parallel.
And that's going to be a big advantage.
So once we compute all these vectors,
we'll group them after, but
we'll get to that in a second.
So you might ask,
what is convolution, anyway?
And so, here is a very simple definition
for any convolutional operator.
So let's look at the simplest
case of a 1d discrete convolution
of a filter over another function,
at a specific point in time.
You'll basically have a filter size,
here M, and
you'll basically multiply just a filter
at different locations of this input.
And so, in computer vision, that will
help us to extract very meaningful
features such as edges from an image and
eventually more complex features.
And for 2d example, which you'll observe
a lot in computer vision, we have this
really great animation here from the
Stanford Unsupervised Feature Learning and
Deep Learning wiki page.
So imagine you had an image
that you see here in green.
And that image, let's say, is only binary.
The first row of this image is 1, 1,
1, 0, 0 and the second row of pixels
of this binary image is 0, 1, 1, 1,
0 and so on, and you have a filter.
And this filter here has number that
you'll see in the small red font here, and
I’ll turn the animation off for a second
so we can look at it without moving.
Now, the filter here is basically 1,
0, 1, 0, 1, 0, 1, 0, 1.
And now every time step
of the convolution,
we’re going to multiply the numbers of
the filter with the numbers off the image.
We multiply, again,
the red numbers from the filter with the,
images, with the image values and
that will result,
basically multiply all of
them then we sum them up.
So very simple in our product if we were
to vectorize these three by three blocks
into, and nine dimensional vector and
we just have a simple inner product
between those two vectors or we just
multiply them here and then sum them up.
So one times one plus one times zero plus
one times one and so on will sum to four.
And we'll basically move
this filter one time step,
one pixel at a time across the image.
So let's look again, this looks like
basically multiply all the numbers and
then sum them up.
And then, we'll move one down, and
again move from left to right.
Any questions, yeah?
That's a great question.
What would be the equivalent
of a pixel in LP and
yes, you're exactly right,
it will be a word vector.
Before I jump there,
are there any more questions about the
general definition of convolution, yeah?
How do we decide on the convolution?
So how do we decide what matrix it is?
The matrix of the convolution
of filter here,
these red numbers are actually
going to be learned.
So you have an input and then you do
back propagation through a network,
we'll get to eventually it'll have
the same kind of cross entropy error,
that we have for all the other ones.
It'll have a softmax and
we're going to basically back propagate
through this entire architecture, and
then we'll actually update the weights
here in this particular example in red.
After they started with
some random initialization.
And then we'll update them and
they'll change.
And what's kind of interesting
in computer vision,
which I won't go into too many details in
this class, but in computer vision they
learn eventually to detect
certain edges in the first layer.
In the second layer they'll learn to
detect certain combinations of edges like
corners, and the third layer they
will learn to basically detect and
have a very high activation,
these guys here.
A very high activation when you see
more complex patterns like stripes and
things like that.
And as you go higher up through
convolution networks and computer vision,
you can actually very nicely
visualize what's going on, and
you identify, like the fifth layer
some neurons actually fire when,
they see a combination of eyes and
a nose and a mouth.
Sadly, for NLP we don't have any of that.
It's one of the reason's they're
not quite as popular in NLP.
Yep.
Sure, so you have here m, you filter.
So in the 1d case, that'll just be, f and
g are just a single number, and
now you're going to move over f.
So imagine this was just one dimension.
And so you move from minus M to M,
as in for the nth time step,
you're going to multiply the filter,
g[m] here,
over this function input, and
basically go one times step, and
you sum up this product between
the two numbers at each time step.
Does that make sense?
So you go from minus m,
which if this is minus and
this is minus, so you start head of n,
n time steps away, and
then you keep multiplying the numbers
until you have the whole sum.
And then you have your convolution
at that discrete time step n.
>> [INAUDIBLE]
>> That's right, m is your window size.
And we'll go over the exact examples for
NLP in much more detail.
Yeah.
How do we figure out the window size?
We'll actually have a bunch of window
sizes, so maybe this is a good side way
to talk about the actual
model that we'll use for NLP.
So this is going to be the first and
most simple variant of
a convolutional network for NLP.
You can [INAUDIBLE] go to town and
towards the end they'll,
show you some examples of how we can
embellish this architecture a lot more.
This one is based on a really seminal
paper by Collobert and Weston from 2011.
And then the very particular model
in its various queuing details,
came from Kim from just three years ago.
Basically the paper title is,
a Convolutional Neural Network for
Sentence Classification.
All right, so as with every model out
there, whenever you wanna write down your
equations, no worries, we're not gonna
go into a lot more derivatives today.
Actually no derivatives,
cuz all the math is really similar
to math we've done before.
But it's really still important to
identify very clearly your notation.
So let's start.
As the question was correctly asked,
we'll actually start with word vectors.
So we'll have at every time step,
I will have a word vector Xi.
And that will be here for
us now a k dimensional vector.
And then we'll represent the entire
sentence through a concatenation.
So we'll use this plus and
circle symbol, for concatenating
the vectors of all the words.
And so we'll describe the entire sentence,
which we'll have for
our definition here,
n-many words to be X from one to n.
And that will be the concatenation
of the first to the nth word vector.
Yeah.
Great question,
are word vectors concatenated length-wise?
Yes.
For now we'll assume they're
all concatenated as a long row.
All right now we'll introduce
this additional notation here, so
we don't just go from one to n, but
we might actually want to extract specific
words in the range,
from time step i to time step i plus j,
or in general some other
number of time steps.
So if I have, for instance x two to four,
then I'll take the second, the third,
and the fourth word vector, and
I just have a long vector with just those
three word vectors concatenated together.
I'll let that sink in,
cuz it's all very simple but
we just need to make sure we
keep track of the notation.
So in general, our convolutional
filter here will be a vector w
of parameters, that we're going
to learn with our standard
stochastic gradient descent-type
optimization methods.
And we'll define this
convolutional filter here,
in terms of its window size and
of course the word vector size.
So h times k, so this is just a vector,
it's not a matrix.
There's no times in between the two.
But let's say we want to have
a convolutional filter that at
each time step, looks at three different
word vectors and tries to combine
them into a single number, or
some kind of feature representation.
What we'll then do is,
basically have a three
times number of dimensions
of each word vector filter.
So I have a very simple example here.
Let's say we have two dimensional
word vectors, of course just for
illustration they'll usually
be 50 dimensional or so.
Let's say we have two
dimensional word vectors, and
we look at three different words in
concatenation at each time step,
we'll basically have
a six dimensional w here.
All right.
So now how do we actually compute
anything and why is it a neural network?
We'll have some non-linearity
here eventually.
Okay but before we get there,
let's look at again,
we have our convolutional filter,
goes looks at h words at each time step.
And again note that w here is
just a single vector; just as
our word vectors are also
concatenated into a single vector.
And now in order to compute
a feature at one time step for this,
what we're going to do is basically just
have an inner product of this w vector of
parameters, times the i-th time
step plus our window size.
So in this case here,
we're going to in the c one for
instance, we'll have W times x one two,
one two three.
So we have here three, so
one plus three minus one goes to three.
So we basically just have
the concatenation of those word
vectors in our product.
Simple sort of multiplication and
sum of all the element wise.
Elements of these vectors.
Then usually,
we'll have our standard bias term and
we'll add a non-linearity at the end.
Any questions about.
That's a great question.
So, as you do this, the question is don't
the words in the middle appear more often.
So here, actually show this example,
and I have actually an animation, so
you are jumping a little bit ahead.
So what happens for instance,
at the very end here, and
the answer will just come
have zeros there for the end.
We'll actually call this
a narrow convolution and
where you can actually have wide
convolutions which we'll get to later,
but yes, you're right the center
words will appear more often,
but really the filters can
adapt to that because you learn
sort of how much you want to care about
any particular input in the filter.
Okay, so
let's define this more carefully so
we can think through the whole process,
yeah?
So the question is,
rephrase it a little bit, what happens
when we have different length sentences?
And there will actually be in two
slides a very clever answer to that.
Which is at some point we'll
add a pooling operator,
which will just look at the maximum
value across everything.
We'll get to that in a second.
And it turns out the length
of the sentence doesn't matter that
much once we do some clever pooling.
How's the size of the filtering
affecting the learning?
Actually quite significantly.
One, the longer your filter is the more
computation you have to do and
the longer context you can capture.
So for instance if you just had a one
d filter it would just multiply and
matrix with every word vector and
it actually would,
you wouldn't gain much, because it would
just transform all the word vectors, and
you may as well store
transformed word vectors.
As you go to longer filters, you'll
actually be able to capture more phrases,
but now you'll also more
likely to over-fit your model.
So that will actually be, the size of
your filter will be hyperparameter, and
there are some tricks.
Namely, you have multiple filters for
multiple lengths,
which we'll get to in a second, too,
that will allow you to get rid of that.
Alright, so, let's say again here,
we have our sentence,
now we have all these possible windows or
length h,
starting at the first word vector,
going to this, and so on.
And now what the means is, since we do
this computation here at every time step,
we'll have basically what
we call a feature map.
And we will capitalize this
here as having a vector
of lots of these different c values.
And again, each c value was
just taking that same w and
having inter-products with a bunch of
the different windows at each time stamp.
Now, this c vector is going to be
a pretty long, n-h+1 dimensional vector.
And it's actually going to
be of different length,
depending on how many words we have.
Which is a little odd, right?
Because in the end, if we want to
plug it into a softmise classifier,
we would want to have
a fixed dimensional vector.
But, intuitively here, we'll just, again,
multiply each of these numbers and
our w here with the concatenation,
and remove along.
Turns out we'll zero pad.
And if you now think carefully, you'll
actually realize, well, I kind of cheated
because really that's what we really
should've done also on the left side.
So on the left side we will actually
also zero pad the sentence.
So we do exactly the same in
the beginning at the end of the sentence.
All right, now, because we have a variable
length vector at this point, and we want
to have eventually a fixed dimensional
feature vector that represents that whole
sentence, what we'll now do is introduce a
new type of building block that we haven't
really looked at that much before, namely,
a pooling operator or pooling layer.
And in particular, what we'll use
here is a so-called max-over-time or
max pooling layer.
And it's a very simple idea,
namely that we're going to capture
the most important activation.
So as you have different elements
figured computed for every window,
you have the hope that the inner
product would be particularly large for
that filter, if it sees a certain
kind of phrase, all right?
So, namely, if you have, let's say your
word vectors are relatively normalized,
if you do an inner product,
you would want to have a very large cosine
similarity between the filter and the
certain pattern that you're looking for.
And that one filter would only be
good at picking up that pattern.
So for instance,
you might hope all your positive words
are in one part of the vector space and
now you have a two dimensional,
sorry a two word vector, sorry.
A filter size of length two
that looks at bigrams, and
you want to ideally have
that filter be very good and
have a very large inner product with
all the words that are positive.
And that would then be captured by having
one of these numbers be very large.
And so what this intuitively allows
you to do is, as you move over it and
you then in the end max pool,
if you just have one word pair,
one biagram that it has a very large
activation for that particular filter w,
you will basically get
that to your c hat here.
And it can ignore all
the rest of the sentence.
It's just going to be able to pick
out one particular bigram very,
very accurately, or a type of bigram.
And because word vectors cluster and
where similar kinds of words
have similar kinds of meaning,
you might hope that all the positive words
will activate a similar kind of filter.
Now the problem with this is, of course,
that that is just a single number, right?
C hat is just a maximum number here
of all the elements in this vector.
So I would just be five.
So that could be one activation.
If we use a relu nonlinearity here,
this will just be a single number.
So c hat is just that.
Now of course, we want to be able to do
more than just find one particular type of
bigram or trigram, we want to have many
more features that we can extract.
And that's why we're going
to have multiple filters w.
So instead of just convolving
a single feature w,
we'll convolve multiple of them.
And as we train this model,
eventually we hope that some of the w
filters will fire and be very active and
have very large inter-products with
particular types of bigrams or
trigrams, or even four grams.
So it's also very useful to have some
filters that only pick out bigrams and
you can actually get quite far with that.
But then maybe you have someone,
some examples where you say for
sentiment again very simply example
it's not very good or risk missing
a much originality and
now you want to have diagrams in filters
of length K times 3.
And so, we can have multiple different
window sizes and at the end, each time we
convolve that filter and we do all
these inner products at each time step.
We'll basically max pool to get a single
number for that filter for that sentence.
If we have different filters
of different lengths,
how do we make sure they
learn different feature?
Of same length or different lengths, yeah.
Of same length, how do we make sure
they learn different features?
Well, they all start at different
random initializations, so
that helps to break up some symmetry.
And then actually we don't have
to do anything in particular
to make sure that happens,
it actually just happens.
So as we do SGD,
from the random initializations,
different filters will move and start
to pick up different patterns in order
to maximize our overall
objective function.
Which we'll get to,
it'll just be logistic regression.
They would probably still
learn different values, yeah.
You update so in the beginning,
well, if they're exactly the same,
basically, as you pool, right,
you will eventually pick,
during backpropagation,
the max value here.
The max value will come,
eventually, from a specific filter.
And if they have the exact same,
one, you would never do it.
But two, if you did,
they would have the exact same value.
And then your computer will have to
choose, randomly, one to be the max.
And if they're just the same,
whatever, it'll pick one and
then it'll backpropagate
through that particular filter.
And then,
they're also going to be different in the
iteration of your optimization algorithm.
Yeah?
>> Is there a reason why
we do the max [INAUDIBLE]?
>> Is there a reason why we do the max?
So in theory nothing would
prevent us from using min too.
Though we in many cases use rectified
linear units which will be max 0x.
And so max pooling makes a lot more
sense cuz min will often just be 0.
And so, we've rallies together,
it makes the most sense to use
the max pooling layer also.
Could we use average pooling?
It's actually not totally crazy,
there are different papers that explore
different pooling schemes and there's no
sort of beautiful mathematical reason
of why one should work better but
intuitively what you're trying to do here
is you try to really just fire when
you see a specific type of engram.
And when you see that
particular type of engram,
cuz that filter fired very strongly for
it, then you wanna say this happened.
And you want to give that signal
to the next higher layer.
And so that is particularly easy if you
choose a specific single value versus
averaging, where you kind of
conglomerate everything again.
And the strong signal that you may get
from one particular unigram, or bigram, or
trigram, might get washed
out in the average.
Great question, so once we have a bunch of
different c hats from each of the filters,
how do we combine them?
And the answer will be,
we'll just concatenate all them.
We'll get to that in a second.
Yeah, so the main idea is once you do
max pooling one of the values will
be the maximum and then all of the other
ones will basically have 0 gradients cuz
they don't change the layer above,
and then you just flow your gradients
through the maximum value
that triggered that filter.
So the question is, doesn't that make
our initialization very important, and
lead to lots of downstream problems?
And the answer is yes,
so likewise if you, for
instance, initialize all your filter
weights such as your rectified
linear units all return zero then,
you're not gonna learn anything.
So you have to initialize your
weights such that in the beginning,
most of your units are active and
something will actually happen.
And then the main trick to, or the way,
the reason why it doesn’t hurt
a ton to have these different
randomizations, you have lots filters.
And each filter can start to pick up
different kinds of signals during
the optimization.
But, in general, yes,
these models are highly non-convex and
if you initialize them incorrectly,
they won’t learn anything.
But we have relatively
stable initialization
schemes at this point that
just work in most cases.
Great questions, all right I like it.
All right, so we basically now have,
we're almost at the final model.
But there's another idea here And
that combines what we've learned about
word vectors, but extends it a little bit.
And namely, instead of representing the
sentence only as a single concatenation
of all the word vectors, we'll actually
start with two copies of that.
And then we're going to backpropagate into
one of these two channels and
not into the other.
So why do we do this?
Remember we had this lecture where
I talked about the television and
the telly, and
as you back-propagate into word vectors,
they start to move away from their
Glove or word2vec initialization.
So again, just quick recap,
word vectors are really great.
We can train them on a very
large unsupervised scope so
they capture semantic similarities.
Now if you start backpropagating your
specific task into the word vectors,
they will start to move around.
When you see that word vector in your
supervised classification problem in that
dataset.
Now what that means is as you
push certain vectors that you see
in your training data sets somewhere else,
the vectors that you don't see in your
training data set stay where they are and
now might get misclassified if
they only appear in the test set.
So by having these two channels We'll
basically try to have some of the goodness
of really trainings,
the first copy of the word vectors
to be really good on that task.
But the second set of word vectors to
stay where they are, have the good, nice,
general semantic similarities in vector
space goodness that we have from unlarge
and supervised word vectors.
And in this case here, both of these
channels are actually going to be added to
each of the CIs before we max-pool, so
we will pool over both of those channels.
Now, the final model, and this is the
simplest, one I'll get to you in a second.
Is basic just concatenating
all this c hats,
so remember each c hats
was one max pool filter.
And we have this case here
that say m many filters.
And so our final feature vector for
that sentence,
has just an r n-dimensional vector,
where we have m many different filters
that we convolved over the sentence.
And then we'll just plug that
z directly into softmax, and
train this with our standard logistic
regression cross entropy error.
All right, we had a question?
By having two copies of the work vectors,
are we essentially doubling the size?
Well, we're certainly doubling
the memory requirements of that model.
And we just kinda assume, you could
think of it as doubling the size of
the word vectors, and then the important
part is that only the second half of
the word vectors you're going to
back propagate into for that task.
That's right, we can use the same
convolutional weights, or
you can also use different convolutional
weights, and then filter, and
you can have multiple, and this model
will have many of them actually.
It could have 100 bigram filters,
100 trigram filters,
maybe 24 filters, and
maybe even some unigram filters.
So you can have a lot of different hyper
parameters on these kinds of models.
So quickly.
For a given sentence does
the convolutional matrix stay the same?
So this matrix is the only
matrix that we have.
This is just our standard soft matrix and
then before we had these w filters,
these vectors.
And yes each w is the same as you convolve
it over all the windows of one sentence.
So lots of inner products for
a bunch of concatenated word vectors,
and then you max pool, find the largest
value from all the n-grams.
And that's a CNN layer and
a pooling layer.
Now, here's graphical description of that.
Here, instead of concatenating them,
this just kind of simplified this, so
imagine here you have n same notation.
We have n many words in that sentence,
and each word has
k as a k dimensional feature vector,
or word vector associated with it.
So these could be our glove or
other word to vector initializations, and
now this particular model here shows us
two applications of a bigram filter and
one of a trigram filter.
So here this bigram filter looks at
the concatenation of these two vectors and
then max pool them into a single number.
And as you go through this,
you'll basically get lots
of different applications.
And you basically, for
each of the features,
you'll get one long set of features, and
then you'll get a single number
after max pooling over all
these activations from
[INAUDIBLE] grand positions.
So you see here for instance so
the bigram filter is this channel and
then we'll basically max pool.
Over that, again, notice how here they use
indeed the same filter on the second word
vector channel,
the one we might back propagate into.
But they will all
basically end up in here.
So just, again, inner products
plus bias and non linearity and
then we'll max pool all those numbers
into a single number up there.
And now,
a different namely this guy up there.
The trigram also convolves over that
sentence and basically combines a bunch of
different numbers here and then gets
max pooled over a single number there.
Great question.
So do we always max
pool over particularly,
just a set of features that are all
coming from the same filter.
And the answer is in this model we do, and
it's the simplest model that
actually works surprisingly well.
But there are going to be,
right after our quick research highlight,
a lot of modifications and
tweaks that we'll do.
There are no more questions,
let's do the research highlight and
then we'll get to how to tune
that model should be on.
>> Hello?
>> Yeah.
It's cool.
>> So hi, everyone, my name's Amani and
today I thought I would share with you
a very interesting paper called
Character-Aware Neural Language Models.
So on a high level as the title implies
the main goal of this paper is to come up
with a powerful and
robust language model that effectively
utilizes subword information.
So to frame this in a broader context,
most prior neural language models do not
really include the notion that words that
are structurally very similar should have
very similar representations in our model.
Additionally, many prior neural language
models suffered from a rare-word problem.
Where the issue is that if we don't really
see a word that often or at all in our
dataset then it becomes very hard to come
up with an accurate representation for
that word.
And this can be very problematic
in languages that have long tail
frequency distributions or in domains
where vocabulary is constantly changing.
So to address some of these problems,
the authors propose the following model,
where essentially we will read in
our inputs at the character level,
but then we will make our
predictions still at the word level.
So let's dive a little bit
deeper into the model and
see exactly what's happening here.
So the first thing we do
is that we take our input,
and we break it apart
into a set of characters.
Where for each character,
we associate it with an embedding
that we learned during training.
We then take the convolutional network and
take its filters and
convolve them over them the embeddings
to produce a feature map.
And finally, we apply max pooling over
time, which intuitively is selecting out
the dominant n-grams or substrings
that were detected by the filters.
We then take the output of
the convolutional network, and
pipe it into a highway network.
Which we're going to use to essentially
model the interactions between various
n-grams.
And you can think of this layer as being
very similar to an LSTM memory cell,
where the idea is that we want to
transform part of our input, but
also keep around and
memorize some of the original information.
We then take the output of the highway
network and pipe it into a single timeset
of LSTM, which is being trained to produce
sequence given the current inputs.
And the only thing different to note here
is that we're using hierarchical softmax
to make predictions due to
the very large output vocabulary.
So let's analyze some of the results.
So as we can see here from
the table on the right,
the model is able to obtain comparable
performance with state of the art
methods on the data set while utilizing
fewer parameters in the process.
What's also really remarkable is I was
able to outperform its word level and
working level counterparts across
a variety of other rich languages,
such as Arabic, Russian,
Chinese, and French.
While using, again, fewer parameters
in the process because now
we don't have to have an embedding for
every single word in our vocabulary but
now only for
every single character that we use.
We can also look at some of
the qualitative results to see what is it
the results is exactly learning.
So in this table we have done,
is that we have extracted the intermediate
representations of words at
various levels of the network and
then computed their nearest neighbors.
And what we find is that, after applying
the CNN, we are grouping together words
with strong sub-word similarity and
that after applying the highway network,
we are also now grouping together words
that have strong semantic similarities.
So now the word Richard is
close to no other first names.
We can also look and
see how it handles noisy words.
So in this case, the model is able
to effectively handle the word look
with a lot of O's in between,
which it has never seen before.
But it is now able to assign it
to reasonable nearest neighbors.
And on the plot on the right, what we
see is that if we take the in-grammar
presentations learned by the model and
plot them with PCA.
We see that it is able to isolate
the ideas of suffixes, prefixes, and
hyphenated words.
Which shows that, at its core, the model
really is learning something intuitive.
So in conclusion, I wanna sort of
highlight a few key takeaway points.
The first is that this paper shows that
it is possible to use inputs other than
word embeddings to obtain superlative
performance on language modeling.
While using fewer
parameters in the process.
Second it shows that,
it demonstrates the effectiveness of
CNNs in the domain to language modeling.
And shows that, in this case, the CNNs and
Highway Network are able to extract
which types of semantic and
orthographic information from
the character level inputs.
And finally, what's most important is
that this paper is combining the ideas of
language modelings, CNNs,
LTMs, hierarchical softmax,
embeddings all into one model.
Which shows that basically we can treat
the concepts that we've learned over
the course of the quarter
as building blocks.
And learn to compose them in
very interesting ways to produce
more powerful or more nuanced models.
And that is a very useful insight to have
as you approach some of your own projects,
not only in the class, but also beyond.
And with that, I would like to conclude
and thank you for your attention.
>> [APPLAUSE]
>> Character models are awesome.
Usually when you have a larger downstream
task, like question answering or
machine translation, they can give
you 2 to 5% boost in accuracy.
Sadly, when you run any kind
of model over characters,
you think you have a sentence or
document with 500 words.
Well, now you have a sequence
of 500 times maybe 5 or
10 characters, so now you have
a 5,000 dimensional time sequence.
And so when you train your
model with character levels,
think extra hard about how long it
will take you to run your experiments.
So it's kind of a very clear,
sort of accuracy versus time
tradeoff in many cases.
All right, so I mentioned the super
simple model where we really just
do a couple of inner products,
over a bunch of these filters,
find the max, and
then pipe all of that into the softmax.
Now that by itself doesn't work to get
you into state-of-the-art performance so
there are a bunch of tricks that
were employed by Kim In 2014.
And I'm going to go through a couple
of them since they apply to a lot of
different kinds of models that
you might wanna try as well.
The first one is one that I think
we've already covered, dropout, but
we did right?
But it's a really neat trick and you can
apply it lots of different contexts.
And its actually differently applied for
convolutional networks and
recurrent neural networks.
So it's good to look at
another application here for
this particular convolution
neural network.
So just to recap, the idea was to
essentially randomly mask or dropout or
set to 0 some of the feature weights that
you have in your final feature vector.
And in our case that was z,
remember z was just a concatenation
of the max built filters.
And another way of saying that is that
we're going to create a mask vector r.
Of basically, random Bernoulli
distributed variables with
probability that I would
probability p set to 1.
And probably 1 minus p set to 0.
And so what this ends up doing
is to essentially delete
certain features at training time.
So as you go through all your filters,
and you actually had a great biagram.
And another good biagram,
it might accidentally or
randomly delete one of the two biagrams.
And what that essentially helps us to do
is to have the final classifier not
overfit to say, it's only positive for
instance if I see these
exact two biagrams together.
Maybe it's also positive if I
see just one of the biagrams.
So another way of saying that is
that it will prevent co-adaptation
of these different kinds of features.
And it's a very, very useful thing.
Basically every state-of-the-art
model out there that you'll observe,
hopefully somewhere in its experimental
section it will tell you how much it
dropped out of weights and
what exactly the scheme of dropout was.
Cuz you can dropout, for instance,
through recurrent neural network,
you can dropout the same set of
features at every time step or
different sets of features
at every time step.
And it all makes a big
difference actually.
So this is a great paper by Geoff Hinton
and a bunch of collaborators from 2012.
Now, if you carefully think through what
happens here, well, at training time,
we're basically, let's say it's 0.5,
probability p is 0.5.
So half of all the features are randomly
getting deleted at training time.
Well then, the model is going to
get used to seeing a much smaller
in norm feature vector z or
had a more product here or time z.
And so, basically at test time,
when there's no dropout,
of course at test time,
we don't want to delete any features.
We want to use all the information
that we have from the sentence,
our feature vector z
are going to be too large.
And so what we'll do is,
in this care here,
we'll actually scale the final vector
by the Bernoulli probability p.
So, our Ws here, the softmax weights,
are just going to be multiplied,
and essentially halved.
And that way,
we'll end up in the same order of
magnitude as we did at training time.
Any questions about dropout?
What's the intuition?
So some people liken dropout
to assembling models.
And intuitively here you could have,
let's say,
deterministically you were dropping out
the first half of all your filters.
And you only train one model on
the first half of the filters, and
you train the second model on
the second half of the filters.
And then in the end you average the two.
That's kind of similar, but
in a very noisy variant of what
you end up doing with dropout.
And so, in many cases this can give
you like 2%-4% improved accuracy.
And when we look a the numbers, you'll
notice that it's those 2%-4% that gets you
that paper published and
people looking at your method.
Whereas if it's 4% below,
it's getting closer and
closer to a very simple back or
forwards model with discrete counts.
Is it possible to dropout
the link instead of the node?
So you could actually dropout some
of the weight features as well.
And yes, there is actually
another variant of dropout.
There's the filter weight dropout and
there is the activation dropout.
So in this case here we
have activation dropout.
And they have different advantages and
disadvantages.
I think it's fair to say that,
especially for NLP,
the jury is still out on which
one should you always use.
I think the default, you just filter
out and the original dropout is just
to set to 0 randomly the activations and
not the filter reads.
All right, now, one last question.
So basically,
this will have a certain norm.
And at training time, the norm of this
is essentially, say halved if you have
a probability of p To multiply
the features with zero.
And so what that means is that, overall,
this matrix vector product will have
a certain size and a certain certainty
also, once you apply the softmax.
And if you don't wanna basically
be overly confident in anything,
you wanna scale your W because at test
time you will not drop out anything.
You will have the full z vector, not
half of all the values of the z vector.
And so at test time you wanna use as
much information you can get from z.
And because of that,
you now have a larger norm for z.
And hence,
you're going to scale back W, so
that the multiplication of the two
ends up in roughly the same place.
Very good question, so
what's the softmax here?
So, basically z was our vector for
some kind of sentence.
And I use the example sentiment because
that is one of the many tasks that
you could do with this.
So, generally sentence classification,
or document classification,
are the sort of most common task
that you would use this model for.
We'll go over a bunch of examples in three
slides or so and a bunch of data sets.
Awesome, so
now there's one last regularization trick
that this paper by Kim used in 2014.
It's actually not one that
I've seen anywhere else.
And so I don't think we'll have to
spend too much time on it but they
essentially also constrain the l2 norm of
the wave vectors of each of the classes.
So we have here, remember this is
our softmax weight matrix W and
c dot was the row for the cth class.
And they basically have this additional
scheme here where whenever the norm
of one of the rows for one of these
classes is above a certain threshold, S.
Which is another
hyperparameter they'll select,
it will rescale it to be exactly S.
So basically they'll force the model
to never be too certain and
have very large weights for
any particular class.
Now it's a little weird cuz in general
we have l2 regularization on all
the parameters anyway.
But they saw a minor improvement.
It's actually the only paper that I can
remember in recent years that does that,
so I wouldn't overfit
too much on trying that.
Now, it's important to set back and
think carefully.
I described this model and I described it
very carefully but when you think about
it, you now have a lot of different
kinds of tweaks and hyperparameters.
And you have to be very conscious in all
your projects and every application in
industry and research and everywhere of
what your hyperparameters really are.
And which ones actually matter to your
final performance, how much they matter.
And in an ideal world,
you'll actually run an ablation where
maybe you have these two word vectors.
The ones you back propagate into and then
the ones you don't back propagate into.
How much does that actually help?
Sadly, in very few examples,
people actually properly ablate,
and properly show you all
the experiments they ran.
And so let's go over the options and the
final hyperparameters that Kim chose for
this particular convolutional
neural network model.
The one amazing thing is they actually
had the same set of hyperparameters for
a lot of the different experiments.
A lot of these are sentiment analysis,
subjectivity classification,
as most of the experiments here.
But they had the same set of
hyperparameters, which is not very common.
Sometimes you also say, all right,
here are all my options.
And now, for every one of my datasets,
I will run cross-validation over
all the potential hyperparameters.
Which, if you think about it,
is exponential, so it would be too many.
So then, the right thing to often do
is actually to set boundaries for
all your hyperparameters.
And then, randomly just sample
in between those boundaries.
So for instance, let's say you might
have 100 potential feature maps for
each filter, for each window size.
Now you say, all right,
maybe I'll have between 20 and 200.
And you just say, for
each of my cross-validation experiments,
I will randomly sub sample
a number between 20 and 200.
Then, I'll run experiments
with this number of filters
on my developments split and
I'll see how well I do.
And you say I have maybe 100
experiments of this kind.
You'll quickly notice that, again,
why you need to start your project early.
Cuz your performance will also depend
highly on your hyperparameters.
And if you don't have
time to cross-validate,
you may lose out on some of the accuracy.
And especially as you get closer to
potentially state-of-the-art results,
which I think some of the groups will.
That last couple percent that you can
tweak and squeeze out of your model with
proper hyperparameter search can
make the difference between having
a paper submission or having a lot of
people be very excited about your model.
Or ignoring it, sadly.
Yep.
Great question.
Do you do that sampling for
one hyperparameter at a time or not?
In the end, in the limit,
it doesn't matter which scheme you use.
But in practice, you set the ranges for
all your hyperparameters and
then you sample all of them for
each of your runs.
And it's very,
very counterintuitive that that would
work better than even a grid search.
Where you say, all right,
instead of having 100 feature maps and
randomly sample between 20 and 200.
I'm gonna say, I'm gonna try it for
20, 50, 75, 100, 150, or
200, or something like that.
And then I just multiply all these six or
so
options with all the other
options that I have.
It quickly blows up to a very,
very large number.
Let's say each of these you
try five different options,
that's five to the number of many
hyperparameters that you have.
Which if you have ten parameters or
so that's 5 to the 10,
that's impossible to run a proper grid
search on all these hyperparameters.
It turns out computationally and through
a variety of different experiments,
I think the papers by Yoshua Bengio and
some of his students a couple years ago.
That random hyperparameter search,
works surprisingly well, and
sometimes even better than
a relatively fine grid search.
Until you run out of money on your GPU.
Or until the paper
deadline comes around or
the class project deadline comes around,
yeah.
In the perfect setting,
you have the final model that you think
has all the ingredients that you'd want.
And then you can let it run
until you run out of resources.
Either or time, or money, or GPU time,
or you annoy all your co-PhD students,
such as I did a couple years ago.
[LAUGH] Fortunately, we learned at
some point to have, what is it,
preemptable jobs, so that I could run,
use the entire cluster.
But then when somebody else
wants to use the machine,
it'll just put my job into the cache,
onto memory, or even save it to disk and
anybody else can kinda sort of come in.
But yeah, ideally you’ll run with all
the computational resources you have.
And, of course,
this is depending on how much you
care about that last bit of accuracy.
For some of the papers it
can really matter, for
some applications if
your work in medicine.
You try to classify breast cancer or
something really serious, of course,
you want to squeeze out as much
performance as you possibly can.
And use as many computational
resources to run more hyperparameters.
So there are actually some
people who tried various
interesting Bayesian models of
Gaussian processes to try to
identify the overall function
into hyperparameter space.
So you basically run a meta optimization.
Where instead of optimizing over
the actual parameters w of your model,
you've run an optimization over
the hyperparameters of those models.
Can do that,
it turns out The jury is sort of out, but
a lot of people now say just do
a random hyperparameter search.
It's very surprising,
but that is, I think,
what the current type of hypothesis is for
being the best way to do it.
You can't.
So the question is how do we make
sure the same set of hyper parameters
end up with the same results?
They never do, and some people, this gets,
we could talk a lot about this,
this is kind of fun.
This is like the secret sauce in some ways
of the learning, but some people also say,
I'm going to run the same model with
the same hyper parameters five times, and
then I'm going to average and ensemble
those five models, because they're all end
up in a different local optimum and
that assembling can also often help.
So at the end of every project that you're
in, if you have 100 models that you've
trained, you could always take the top 5
models that you've had over the course of
your project, and ensemble that, and you'd
probably squeeze out another 1 or 2%.
But again, probably don't have to go
that far for your class projects.
It's only if it really matters and
you're down some application,
medical applications or
whatever, to need to do that.
And in many cases what you'll observe
is in papers in competitions,
people write this is my best single model
and this is my best ensemble model.
And then in the best ensemble model
you can claim state of the art and
the best single model might be sometimes
also the best single model, but
sometimes you also have
a more diverse setup model,
so all right last question about
assembling and crazy model header.
[BLANK
AUDIO] Great question.
Why does ensembling still work?
And shouldn't we just have
a better single model?
You're totally right.
There are various ML researchers who say,
I don't like ensembling at all.
We should just work harder
on better single models.
And dropout is actually one such idea that
you can do there, other optimization,
ideas that try to incorporate that,
yeah you're right.
In some ways, what that means is we still
don't have the perfect optimization
algorithms that properly explore the
energy landscape of our various models.
All right, so let's go over the extra
hyperparameters that they used here.
So basically we want to find all
these hyperparameters on the dev set.
Super important,
you get minus 10% if I see you run on
the final test set all your hyperparameter
optimization, because that means you're
now overfitting to your final test set.
One of the number one 101
machine learning rules.
Never run all your hyperparameter across
validation on your final test set.
That's the one thing you want to run maybe
once or twice, it can ruin your entire
career if you do that and
you publish it, it's never worth it.
Don't do it.
All right, so on the development set,
on your development test set, your, or
sometimes called dev set, we or Kim here
tried various different nonlinearities and
in the end chose the rectify linear unit.
So that's actually very common,
and nowadays you almost don't
have to run and try that.
You just use rally as the default.
He had try grams, four grams and
five grams for the various filter sizes.
Somewhat surprising,
note by grams, that surprised me.
He had 100 different feature maps for
each of these sizes.
So 100 tri-gram filters, 100 4-gram
filters, and 100 5-gram filters.
So the more you have,
the more likely each of them can
capture different kinds of things.
So if you have, for instance, just
a simple sentiment classifier, you can
have 100 shots at trying to capture
various types of negation, for instance.
Then drop out just simple in
the middle half of the time
all the features are set to zero.
He chose for this funky
regularization trick, s equals three.
Somewhat surprising your mini
batch size will often also change
your performance significantly.
So, you don't usually want to have
gigantic mini batch sizes for
most NLP models.
So here you had a mini batch size of 50.
During mini batch SGD training and
to use the word vectors
pre-trained on a large corpus and
you had 300 dimensional word vectors.
So that was a lot of hyper parameter
search you can think that was going on in
the background here, yeah.
Wouldn't a higher mini batch size
guarantee that we have less noise
while training?
The answer is yes, but
you actually want the noise.
You have a very nonconvex
objective function here.
And, what dropout does in SGD is
to actually introduce noise so
that you're more likely to explore the
energy landscape, instead of just being
very certain about being stuck in one
of the many local optimi that you have.
So, a lot of optimization tricks and
training tricks of neural networks
in the last couple years can
be described as adding noise
into the optimization process.
And now, this one here's also
super important during training,
how do you select your best model?
So one option is you just let it run, and
at the very end you take that last output.
But what you'll often observe
is a pattern like this.
Let's say you start training.
So these are your iterations, and this
might be your accuracy or your f1 score,
your wu score, your blue score,
whatever you're using for your model.
And now, you'll often observe something
like this, as you train over time.
And now, if you take just
the very last one, maybe here.
Maybe that wasn't as good as this
random spot that as it did stochastic
gradient descent, it found just a randomly
really good spot on your dev set, and
so what Kim does and
what actually a lot of people do
is during training you keep
checking the performance here.
Again this is your
development accuracy and
then you pick the one with
the highest accuracy and
you set those weights to be the weights
for that particular experiment.
And another side trick, just because
they're fun, you can also sample multiple
times, and then ensemble those weights or
average those weights.
And that sometimes also works better,
all right.
Still some black magic.
All right, so that was a lot of good
details for how you tune Junior models.
So here now some of the results, so here
we basically have only sadly in this whole
paper four implementations, and four of
the options that are really carefully
outlined and all the other hyperparameters
we don't know how important they were or
the variance of them.
Yeah?
You can do both.
So you can average the weights,
which is very counter-intuitive,
but also often works.
If you average the predictions,
now you have to keep around and
say you have an ensemble
of the four top models,
you have to keep around your model size
times four, which is very slow and
not that great, so it's less commonly
done, especially not in practice.
Sorry, can I define appellation?
So an appellation study is essentially
a study where you start with
your fancy new model that you
described in all its details.
And then you say how much did each of
the components actually matter to my final
accuracy that I describe in my abstract.
So let's say we have a cool deep learning
model with five-layer LSTMs, and
some attention mechanisms,
and some other clever ways of
reversing the input in machine
translation, for instance, and so on.
And now you say, all right, overall
this model got me a roster of 30 or 25.
Now [BLANK AUDIO] as a practitioner,
I, when I read that paper and
even as a researcher, I want to know well,
you mentioned five tricks,
which of the five were actually
the ones that got you to 25?
Was it?
Yeah?
You want to know because you might
not want to use that entire model but
you want to use that one trick.
So in this case here for instance,
he has this regularization trick.
And he has the dropout.
How much did the dropout actually help,
versus this trick?
Which one should I now use
in my downstream task.
In this paper here, and
I don't want to single him out,
this is really, sadly,
very common in the field.
Nobody give you a nice plot for
every single hyper parameter that says for
this hyper parameter let's say
dropout p between zero and one,
this was my accuracy.
Yeah, so the appellation would say I will
take out one particular modeling decision.
So let's say in his case, for instance.
The application here is do we need
to multichannel two word vectors.
One you back probe into one you don't, or
do we only have a static channel as in
we keep he called static here just keeping
the word vectors fixed in the beginning
versus having only a single channel where
we back propagate into the word vectors
versus having both of
the word vector sets.
And this is the ablation he does here.
The ablation over should we have
two sets of work vectors or not?
And as you can see here, well,
it actually, sometimes it buys you 0.7 and
here 0.9 or so, but
sometimes it also hurts.
So here having the two
channels actually hurt by 0.4.
These are relatively small data sets,
all of them, so
the variance is actually relatively high.
And the first, the simplest one,
is you actually just have random word
vectors, and you back-propagate into them,
and you just learn the word
vectors as part of your task.
So no pre-training of
word vectors whatsoever.
That's actually fine,
if you have a gigantic training dataset.
So if you do machine translation
on five gigabytes of,
don't do it for your project.
I hope we discouraged
everybody from trying that.
But if you do machine
translation on a very,
very large corpus, it turns out you
can just have random word vectors.
And you have so much data on that task,
that as you back propagate into them and
update them with SGD,
they will become very good as well.
That's totally right.
So, Arun correctly points out and
mention this like this is actually very
small datasets and so, there's very
little statistical significance
between most of these results here.
Maybe SST, 0.9.
I forgot all the various thresholds
of statistics, assuming against, for
the various papers.
I think this one might be significant.
But certainly, MPQA for instance,
is a very small data set.
So this 0.1 difference is not
statistically significant
Great question, so
the question is, as you do this,
let's say you had your full data set.
You say, this is my training data.
This is my development split and
this is my final test split, and
we had properly randomized
them in some way.
Now, if I choose based on my development
split and this development accuracy here,
then this model is only trained on this.
And only in sort of some meta kind of way,
used that dev split.
Now, what you could also do and
what you should always do if you have and
actual convex problem, which, sadly,
we don't have in this class very much,
what you would do is you find your
best hyper parameter setting and
then you'll actually retrain with that
whole thing all the way until convergence.
Which if you have a convex problem,
it's great.
You know you have a global optimum, and
you probably have a better global optimal
because you used you
entire trained data set.
Now, it turns out in many cases,
because there can be a lot of variance,
In your training for these very
non-convex neural network models.
It helps a little bit to
just ignore that final,
that deaf part of your data set, and just
use it to only choose the highest point.
But yeah, it's largely because it's
a non-convex problem, great question.
All right, we have four more minutes.
So, one of the problems with this
comparison here, was actually that
the dropout for instance gave it two
to four percent accuracy improvement.
And overall, and you'll see this
in a lot of deep learning papers,
they make claims about,
this is the better model.
Sadly, when you look at it there are some
models here that they're comparing to
that came out after or
before dropout was invented.
So we can be very certain that some
models from pre-2014 didn't use any of
the kinds of tricks like dropout and hence
the comparisons actually kind of flop.
And sadly,
you'll observe this in most papers.
Almost very, very few people in the
community will go re-run with the newest
and fanciest optimization tricks like
Dropout or add in better optimizers and
so on and reimplement all the baseline
models of previous authors, and
then have a proper comparison run
the same amount of cross validation on
the second best model and
other people's models, and
then have a really proper scientific study
to say this is the actual better model
versus this model came out later, had the
benefit of a lot of optimization tricks.
And hence came out on top.
So you'll see that a lot,
and it's in some ways understandable
because it takes a long time to reproduce
ten other people's results and
then start tuning them.
But you have to take a lot of
these with a grain of salt,
because the optimization,
as we see here, makes a big difference.
So two to four percent, when you look
at even some of my old papers, four
precent is the difference between whether
this model is the better one or not.
Still, it is kind of a very cool
architecture, this convolutional network.
The fact that it can do so well overall.
Something that is quite remarkable.
It's relatively simple, and
the nice thing is, with these filters,
each of the filters is
essentially independent, right?
We run max pooling over each
of the filters independently,
so each filter can be run
on one core of your GPU.
And so
despite having 300 different filters,
you can run all of those 300 in parallel,
maximum peril.
And then you have very quickly it can
compute that one feature back there and
pipe it into the softmax.
So that is actually a huge
advantage of these kinds of models.
Now, we don't have that much time left, so
I'm not going to go into too many details,
but you can really go to town and
put together
lots of convolutions on top of pooling
layers in a variety of different ways.
We spend a lot of time trying to
gain intuitions of why this LSTM
node gate has this effect.
I don't think we have a good chance of
going here in the third CNN layer and
having some intuition of why it's
this kind of layer versus another.
They're really,
they really get quite unwieldy.
You can have various kinds of convolution,
so
those are in some sense hyper parameters.
You can ignore basically
zero pad the outside or
you can just not run anything that would
require an outside multiplication and
only run convolutions when you have for
the insides.
Basically this is the narrow
versus the white convolution.
You can eventually run the convolution
also, not over the times steps, but
in later layers over the feature maps.
So they're a lot of different options.
And at some point there's
no more intuition of,
why should you do this in the third
layer of these texts CNN?
One of the most exciting applications
was actually to take such a CNN,
have various pooling operations,
and in the end,
take that as input to a recurrent
neural network for machine translation.
So this was one of the first deep learning
machine translation models from 2013 that
actually combined these
fast parallelizable CNNs,
with a recurrent neural network to do
the machine translation that we've seen.
So, we've essentially described just
model entirely in a lecture before,
but now we're replacing the encoder part,
instead of having an LSTM
here we have a CNN here, and
we give that as an input to all the time
steps at the decoder part of the model.
Very cool model.
Now, probably I'll end on this slide and
we'll maybe talk about
this quasi recurring neural networks that
combines the best of both recurrent and
convolutional models.
For another lecture, but basically you
now know some of the most important and
most widely used models for
deep learning for NLP.
We have the bag of vectors,
surprisingly works quite well when you
combine it with a couple of relu layers.
And can actually even in some benchmarks
beat this convolutional network
that we just described.
So very good base line to run for
a variety of different projects
that we're discussing.
We've discussed the window model
already where we have basically
a very clean model to classify
words in their context.
Now we know
the Convolutional Neural Networks and
we had a lot of variants of
Recurrent Neural Networks.
So, hopefully,
you have most of the tools on Thursday.
Chris will talk about
Recursive Neural Networks or
tree structured Recursive Neural Networks
that will be much more grammatically and
linguistically plausible, but
also have some downsides.
All right, thank you.
