[MUSIC]
Stanford University.
>> What an exciting ride it has been.
Before we tackle the limits of deep
learning for natural language processing.
Some organizational things,
I know some nerves are down to the wire.
So we wanna say first and foremost, sorry
for some of the craziness around PA4.
It's a very large class.
It's a very useful class for
a lot of your guys career, and girls.
So it will be useful,
even if you might get a point here less,
here and there less.
It's a very useful class because it is so
cutting edge with that cutting
edge research vibe and new models.
And the class size and
excitement, it is very hard to make
everything perfect the first time.
So thanks a lot for all your
feedback on the situation with PA4,
there is a lot of internal discussion
in the TA staff and between Chris and
me and we are trying to make it as fair as
possible and help you get off the ground.
The main thing that is I think,
straightforward and
everybody's happy about it is that we'll
give you a short 33-hour extension for
assignment 4 as well as the final project.
So the new deadline that does not use
any late days is 9:00 AM on Sunday.
That's this upcoming Sunday.
And then the hard deadline that sadly we
cannot push any further because we'll have
to actually grade the almost 700 students'
projects, is 9:00 AM on Wednesday.
We have to submit the grades just a few
days later to let people graduate and
all that, so that is the hard deadline,
there's no extension.
Yes?
>> [INAUDIBLE]
>> How do you submit it?
The submission instructions
should be on the.
Ideally, if you do PA4,
you submit to CodaLab to get
the official number as well.
You must, not ideally.
Totally required.
It looks like there are at
least a dozen or two groups for
whom it would be ideal, and
hopefully they will get something.
So we'll go a little bit into
how to help those folks.
All right,
then don't forget the poster session.
It is now actually just slightly
before that final deadline.
But really at the poster session
we wanna be able to get a sense of
what your project is about.
Really the last nine hours or
so that you have of mental time
between the poster session and
the very final deadline, you should just
be spending on writing a nice report.
Editing, nicer looking plots and
things like that.
And maybe finishing that last
cross validation experiment
to tweak your performance by 1 or 2%.
So we expect not too many
excuses at the poster sessions,
saying, this is just a poster.
But in nine hours it will be much,
much better and different.
Really will be looking at that
poster as your main project output.
So the session itself is 5% of your grade,
the final PA4, and
the final project are 27% of your grade.
Any questions around the poster session?
Organizational things?
All right, so another update.
I'll get back to poster session in a bit.
Another updated on PA4.
First we thought there, okay, there is
a couple groups really struggling hard.
We'll give them some more helper code.
It's not really starter
code at this point anymore.
It's just helping you out.
Even the modifications of the starter
code were just pretty minor.
Then, there was a huge backlash of
all the students who did put in
all the work to get to that
baseline model themselves, and
that backlash seem to be larger than
the excitement by the students.
And so again we're trying to
balance things out a lot.
In general, I hope you appreciate the hard
work that all the TA's are doing.
Back when I was undergrad in Germany,
people were just like, you're 10 minutes
late of your assignment submission,
you get zero out of your assignment.
If you can't make the final project or
the final deadline for the midterm or
something, you just take
the class next year.
So, [LAUGH] hopefully we're making
everybody a lot happier than those times,
and we're trying really to be really fair.
So with that said,
we'll give you some starter pseudo-code,
that is our way of trying to make
the two balance the least unhappy.
Really the startup pseudo-code is super
simple, I've given it to a couple of
people who were struggling with QA and
who came to my office already before.
But it's something that you should all
be able to implement fairly quickly at
this point.
And so I'll walk you through a little bit.
This starter code implemented properly and
tuned well,
the right hyper-parameters and so on,
should get you at least above 50% F1.
And the code is essentially, you just
pipe your question through an LSTM,
you get the final hidden
state of a question q.
You pipe your input through an LSTM.
You get an output at each hidden state,
let's call it x_i at
each word in the input.
And then you just take a neural
network to classify with an input,
the two inputs are the question vector,
the final hidden state,
and each hidden state at
a certain time step, x_i.
And then you predict the start token,
and you can either use the same or
probably a different one to predict
the end token for each question.
So something as simple as that should
get you something like 50% F1 score.
And then on top of that,
you can do all the bells and
whistles that the TAs
have talked about before.
You can take all the elements of
the hidden states of the LSTM,
and you do an inner product
with the inputs and
your compute this co-attention or context
matrix, and lots of other extensions.
But really, we hope that this is something
that's possible for everybody, but
the groups who have already put in all the
work, that should not be a big surprise.
And they may have some version of these,
and
probably more advanced
versions than that already.
All right, any questions about
the starter code, the project?
So, I guess, the question is any advice
on should we stick to what we have or
use this simple baseline.
I guess it depends on where
you are with your F1.
If you're much above that, then you
probably don't have to get back to this
and you probably in your current
model capture something of that sort.
In general, these first two steps are good
steps for pretty much every model, so
if you haven't done that
just throw that in there.
These you probably have done
something more advanced by now,
and if you get that then that's fine.
Sometimes, there's always a fine balance,
and you might be really
annoyed with how hard this is.
But this is really also what we would like
you to teach and learn about the field,
and sometimes it's frustrating,
and sometimes you're really stuck.
And then learning exactly how to deal with
this is actually a super valuable skill,
both for Academic research as
well as industrial research.
Sometimes it's very hard
to set up your problem and
know where to get started from.
And so as you put these together,
sometimes you'll have a trade off.
You can tune a baseline more and
get higher or
you have a not well tuned baseline and
add some more complex model variance
to that baseline and also get better.
And so it's always a fine balance.
I think the default Is
just make sure you have a baseline
that is set up that is correct.
And that kind of simple baseline
should get you at least 50%.
Really if you tune that a lot with lots of
crazy interesting dropout over recurrent
nets and so on you could get up to to
60% F1 with this kind of simple model.
Now, you don't need to tune it to death.
Sometimes, you basically get sort
of diminishing returns, right?
If you tune it a little bit, you get
a couple of percent improvements, and
then the last couple of improvements of
the baseline might be harder and harder.
And it might be faster for
you to just implement a slightly
more sophisticated model.
And that's true for generally all
sort of people running NLP systems.
Great question.
All right,
now one last note on the poster session
before we get on to the last limits.
Some of those limits actually
include question and answering.
So we will talk about the dynamic
co-attention network which some
of you may know now.
But, again, everybody is expected
to attend the poster session.
If you can not attend,
you have to submit a small video and
ask for an exception,
especially SCPD students.
Everybody is in two blocks.
We hope that in the block that you're not
assigned, you can actually walk around and
see other student's projects.
I can guarantee you that there's
some really exciting and
interesting projects out there.
And it'll be just I think fun to talk to
students even if you're a little sleep
deprived, maybe just before.
I'm sure I was in most of mine.
You will have a very nice lunch,
lots of food.
And because it's public there's
a lot of excitement around this.
That's kind of what I meant, too,
of yes, it's much harder, this class and
especially this PA4.
But it is also a lot more useful
than a lot of other classes.
I personally know many dozens of people
who took many versions of this class
before and they got job offers
just because of that class and
what they've done and
their projects in this class.
So there will be lots of companies and
representatives from those companies,
there will be VCs and who knows you
might even get some seed funding just
because you have an awesome project.
So it's hopefully that will make you less
upset about the struggle of
the last week for this project.
All right, any last questions
about the poster session?
All right so let's talk about
the limits of single task learning and,
in general, deep learning for
natural language processing.
I think so far the field of deep
learning and NLP has gotten very good at
taking a single dataset task and model and
metric and then optimizing that setting.
That's kind of what we've also gone
through a lot of examples in this class.
And thanks to these end to end trainable,
deep learning models,
the speed of these improvements has
also gotten better and better over time.
Which is really exciting to see,
especially if you followed the field for
a long time.
However, if we continue to start all
these projects from random parameters,
which we mostly do,
except maybe the word vectors.
Word vectors are great sort of to
pre-train a lot of your models.
We won't ever obtain a single natural
language understanding system,
that we can just kind of converse with and
one that understands language
in all of its complexity.
And so
I personally don't think that a single
unsupervised task can fix that either.
In fact you'll hear some
people talk about this and
this is certainly a point of contention.
Can we have a single unsupervised task and
just solve that really well and
then get to some kind
of better AI systems?
I don't think NLP will fall into that
category because largely language
has actually a lot of supervision and
different kinds of feedback.
And requires you in the end to
solve a lot of different tasks.
In language if you want to have a proper
language system, you may have to do
some sentiment understanding of what
you're getting, given this input.
But sometimes you also have to logically
reason over certain kinds of facts.
And other times you have to retrieve
some different facts from a database or
maybe logically reason over facts in
the database and do some memory retrieval.
And yet again other times you have to
ground whatever you're talking about in
the visual or physical world.
And so there are a lot of
different kinds of components, and
if we want to have a system that
understands language better and better,
ideally that system can incorporate
lots of different things.
And so in a more scientific way,
and the way we kind of described in
a lot of tasks, we have different kinds
of frameworks for sequence tagging,
sentence level kinds of classification or
two sentence kinds of classification.
Like understanding entailment,
logical entailment and things like that.
And we have a lot of different kinds
of sequence to sequence models.
And so
as I mentioned a couple of slides ago,
we have a bunch of obstacles
to get towards such a system.
And here's just a couple
of very recent papers.
Several of which I've been involved with,
so I'm very excited about them.
And then some also, one from Google.
Where basically,
we're trying to tackle that limit,
the limits that we have in natural
language processing, especially deep NLP.
The first one is one that we actually
already talked about which is we didn't
have a single architecture
let alone a single model.
Again, architecture might have different
hyper-parameters, different weights for
the different tasks that you work on.
And we all ready basically talked
about this dynamic memory network
which could also be used for
question answering.
And some form of that you might even
be able to use for question answering.
But we all ready talked about that.
So I want to talk about the next obstacle
which we didn't get to last time.
And that is to actually jointly
learn many tasks in a single model.
Now, fully joined multitask
learning is really really hard.
What do I mean by this?
So basically so far when people talk about
multi-task learning or many-task learning,
they assume there's a source task and
then there's a target task.
And they just kind of hope that
the pre-training your neural network
on the source task will
improve another target task.
But in my case, I'd ideally have
both of them be trained jointly, so
instead of having separate decoders for
instance for different languages of
different classification problems.
Ideally we have just a single set of
a very large set of different classes we
might wanna predict about a certain input,
text input.
Really have the exact same decoder.
So if we have a sequence
a sequence model and
we have a question about each sequence.
Ideally, the sequence decoder can just
output different kinds of answers
depending on what the question
was about that input.
Now when people do multitask learning
in many cases they also just
share lower layers and train those
jointly, but not these higher layers.
So what I mean by this,
in natural language processing mostly
we're sharing just the word vectors.
We don't share other
higher LSTM layers for
instance across a whole
host of different tasks.
And computer vision is actually
a little further ahead in that respect.
In that pre-trained CNN.
On a very large dataset like
imageNet can actually be used for
a lot other tasks pretty well.
You just change the top layer of a deep
convolution neural network in computer
vision and
you can still get pretty good accuracy and
transfer a lot of the learnings
from different visual task.
We still can't really do that
very convincingly in NLP And
in many cases you'll only read about
multitask learning in the cases
where the tasks where somewhat related and
hence, helped each other.
So we know, for instance,
part of speech tagging helps parsing.
Cuz the parser makes decisions.
And if it knows a certain
word is a determiner,
then it's almost clear which word
should be the dependent of the other.
However, what you rarely ever read about
is when the tasks aren't perfectly related
and good matches,
they don't help each other.
They actually hurt each other.
And so these kind of negative
results are very hard to publish.
And hence, not talked about very much.
And so, yeah, these are all the issues,
or at least some of the issues,
of why multitask learning is really hard.
And I think at that perimeter of
the limits of deep learning for NLP.
And so, this is a paper that's
currently in submission,
that basically tries to tackle that.
The title of the paper is A Joint
Many-Task Model: Growing a Neural Network
for Multiple NLP Tasks.
And the final model is actually
quite a monster, to be honest.
It has a lot of different components.
Fortunately, we now know pretty
much all of these components.
And hence, we can talk about this very
paper, it's not even published yet,
I mean it's on arXiv.
But you should be able to understand
all the components of this model now.
And be able to implement
something very similar.
So I'll go over it a little
bit in a high level, and
then we'll zoom in to
the different aspects.
And feel free to ask any kind of question.
So the first that we'll do is, we have
some kind of word vector presentations.
And there are actually some clever things
in this paper about n-gram vectors too,
instead of just word vectors.
Which sometimes you have these unknown
words, you can go subword tokens,
Chris mentioned character models
are in a similar kind of idea.
And then, the word vectors are basically
given to a series of LSTMs.
All of these big blocks here are LSTMs.
And the output of one LSTM is given
as input to the next one, but
not the just the output from the softmax
but also the hidden states of the LSTMs,
as a standard when you stack multiple LSTM
nodes or cells on top of one another.
So you have these short
circuit connections.
So the first LSTM here will just classify
a part of speech tags at every word.
The next one will classify beginnings and
endings of chunks.
And then this one will
do dependency parsing.
I'll describe how to do that
with a simple LSTM in a second.
And then, when we classify
dependency parses, for instance,
we still take as input these short circuit
connections from part of speech tags,
to each of these higher level tasks.
And then, at some point, new tasks and
higher level tasks will require you
to understand two sentences at once.
So then we have a simple
sort of pooling scheme,
similar to what we described with
convolusional neural networks,
where we pool over time for
classifying relatedness and entailment.
And in the end we can train this entire
beast jointly in one objective function.
All right, before I jump into details,
any questions high level?
Great question,
why do we have two of them?
So this is just, you can think of it,
if you only have tasks that require
one sentence, you can just have one.
It's just if you want to
classify how related is this
sentence to the other sentence,
we just show two.
And because that's sort of
the highest level we get to,
we just showed it in one plot
to have all things in there.
So the question is,
if the relationship is symmetric,
wouldn't you wanna use the same system for
both sentences?
So, we do use the same system for
both sentences.
These two here are completely
identical pieces, and so is this one.
It's just once you put,
pipe them through here, that you basically
take into consideration where they are.
But you can also pool the,
you can pool cross these
two different final representations, to
make sure that they're symmetric as well.
All right, so I think it'll become clear,
sort of what's going on,
when we zoom into the model.
So, again, we have this character n-grams,
as well as standard word vectors,
like word2vec that we've learned about.
And this first layer here is a very
standard part of speech tagging LSTM.
At every time step we essentially
just have a single layer LSTM, and
we pipe that into softmax.
And then, what we also do is actually
we'll compute a label embedding,
that essentially will allow us to take
into consideration some of the uncertainty
that the part of speech tagger had.
The main idea,
you can think of this basically as another
layer that takes as output the softmax.
But you can also write it as this
kind of convex combination here,
where every label that you have,
has associated with it a vector.
And you basically sum up all these vectors
in a weighted sum here and the weight
depends on how certain the model was,
to have that label at that time step.
So for instance, if you have 3 different,
you have over 40, but let's say you had 3
different part of speech tags, just
adjectives, nouns and verbs or something.
And basically each of these 3 will
have a vector associated with it,
say a 50 dimensional random vector,
it's something you'll learn as well.
And you have some probabilities, you
think like with 0.9 this is a verb, and
0.05 it's an adjective or a noun.
Then you multiply these 3 numbers
with their respective embeddings, and
that will determine
the label embedding for y.
And so now, those are the outputs
of the POS tagging LSTM.
And so to go to the next level,
the chunking model will actually give
as input, again,
the word vectors directly,
the hidden states from the POS LSTM,
and that label embedding.
These are all the inputs, and then we just
plug, those are just concatenated, and
we plug that into another LSTM.
And that will, again, do something
very similar, where it has as output
a hidden state softmax, and then a label
embedding for the chunking labels.
And you could, in theory, do this a lot.
And some previous similar kinds of
architectures had actually thought about
putting all of these into the same layer.
And we compare that, and
we find it works better if you have
these three different tasks, POS,
chunking and dependency parsing,
actually all in their own LSTM layer.
Any questions about that architecture?
Cool, now on dependency parsing, it's a
little more complicated because in the end
we wanna have a tree structure, right?
And so dependency parsing,
turns out, in many cases,
used to require some kind of beam search.
But here this model actually
is incredibly simple.
We, again, have a standard bi-linear,
bidirectional LSTM.
With now four inputs, the word vectors,
the hidden state of the chunker,
and the label embeddings for
POS and chunking.
So these are just four
inputs at every time step.
And now a bidirectional LSTM,
as we defined it in class.
And now basically we'll just run
a quadratic number of classifications of
just saying is this word the dependent of
that word, or of that word, or that word.
Run through all of them and
it would just take the maximum for
each of them and
we just say that's the tree.
Now if you think about this a little
bit it might not even be a proper tree.
Maybe none of them said I am
classified as I'm the root, so
I have all like the potential to
classify I'm the root of the tree.
Or maybe two things
pointed the same parent or
the same child or they create loops or
anything like that.
So in theory, this might not
even create proper trees but
in practice surprisingly it
does in like 99% of the cases.
There's a very small number of
cases where this very simple feed
forward architecture does
not give you proper tree and
you can use basically some very simple
deterministic rule base systems
to clean up that last less than
1% of non-proper trees and
just delete certain edges or
add certain like the route to the tree.
And then you get a proper tree.
And this actually resulted in the state
of the art dependency parser,
submitted it but
since then I think one of Chris's papers.
Just outperformed it
a little bit already again.
It's never ending fun race
that we all work on together.
To work on pushing state
of the art on these tasks.
But yeah, somewhat surprising,
no beam search required,
just feed-forward computation.
And you get pretty good
trees most of the time.
All right, any questions around
the dependency parsing module?
Yeah?
You could do a lot more things to improve
and actually add a proper beam search and
go through several of the scenarios or
something like that.
Proper SQL you can do because you have
these continuous vectors usually and
not SQL is also mostly for consistency
parsing as dependency parsing and so on.
But you could do a lot more
clever things and slow it down.
Surprisingly you don't have to.
We just, all of this computation is
parallelizable, it's super fast,
there's no extra infrastructure needed for
any kind of tree search.
All right, now the last level is
basically to train multiple sentences for
different tasks such as
semantic relatedness.
And what we do here is basically have
a simple temporal max pooling, so
that last hidden stage of
this LSTM is basically.
Just will produce a feature
vector at every time step.
And you will now just look at across all
the feature, the hidden dimensions of
all the time steps, where's largest
value and you just pick that one.
So it's kind of why we call
it temporal max-pooling and
you can then look at again these
simple things like inner-products
between those features, and
vector distances and so on.
Extract some features and
pipe that into another softmax to classify
both relatedness and
entailment kinds of relationships.
So it looks kind of complicated, but
really it uses all the components
that we've carefully went through
in class, just in a clever new way.
Now sadly when you just say,
all right, this is my whole model.
Now back propagate every
time you had a softmax,
we use our standard cross entropy error.
And you just throw that into it,
it doesn't quite work right away.
There's one extra idea
that you have to use and
call this sort of
successive regularization.
Where basically inside each mini-batch
you allow the model to first focus
on different tasks and
then as you go higher,
you will regularized the weights of the
lower levels to not change too much and
that too much is defined by this
regularization term delta here.
So this is basically,
then one of the novelties of how to make
the training more robust, and actually
result in the end, with a final system
that gets the state of the art on four
out of the five tasks that we looked at.
And so again, intuitively here you have
at the end of the first mini-batch
where you focused on just part of speech
tagging, you have a set of weights theta
that define your label embeddings,
your LSTM weights, and so on.
And you now say when you train the next
higher level task in chunking
to not move too far away.
From those weights that were really
well tuned for part of speech tagging.
And then, as you go higher and higher,
you basically try to keep
more things the same.
But if the higher level task really
wants to change a certain weight,
it can still do it.
That's right, so the question is as you
train inside each mini-batch or really
almost like the whole epoch, you can focus
first on each of the different tasks,
and you do that in a way that you
start with the lower level tasks and
then you move up through the network,
that's right.
So each mini-batch is actually
focused on a single task.
So each mini-batch focuses on one task,
but as you go and
you finish on that,
you go to the next task.
That's exactly right.
When you go to the next task
you have a soft sort of
regularization or
clamp on those previous slides.
So that's something that could actually
work for various projects to some folks
had the idea of using SNLI or entailment
classification as a pre-training step.
For question answering and those are all
kinds of ideas that you could try as well.
So, there are, most of those tasks
that joint training actually helps.
There's a lot of complexity in
getting all these numbers, and
this whole paper actually has
like over 12 or 15 tables for
the various ablation studies of using
the successive regularization, yes or no?
Using character n-grams, yes or no?
Versus just word vectors.
Training various combinations
of tasks together.
There's a lot of experiments
that went into this task.
Basically overall, these two are sort
of the summary of the table.
When you look at all the tasks
separately and they train
all separately versus they're jointly
trained, they basically all improve.
And these are all tasks that have
been worked on quite a lot so
you don't see huge improvements.
Relatedness here is actually
the lower the better.
So this is also good.
For some of the higher level tasks
on smaller data sets you also get
larger improvement.
So in general joint training with
different tasks often helps.
When you have less data.
It helps more when you have
less data per task, right?
Because then you can transfer more.
If you have a simple task, that is
only a binary classification problem,
for instance, or something like SNLI,
where you have is this entailment
contradictory or neutral and
you have hundreds of thousands of
examples for each of the three labels,
then you can probably just get away with
training everything on just that dataset.
Of their more complex output spaces, so
machine translation for instance, or
the smaller your data set this is,
the more you benefit from trying to
jointly train with different
kinds of objective functions,
having some unsupervised
pre-training of word vectors,
then maybe some semi-supervised things,
where you have, you continue to train.
And supervised word vectors together
with some supervised tasks and so on.
This result here is in parenthesis
because the part of speech tagging and
chunking subsets are actually,
sorry, the dependency and
chunking results actually
overlapping on the DEV and test set.
And so obviously, chunking will
help a lot, in the dependency part.
You know that inside this chunk
everything should point to one another
inside the dependency tree.
And so,
this result is a little too optimistic, so
we have one that just trains
on these two jointly.
You still have an improvement,
but it's less strong.
So these in parenthesis numbers,
when you carefully look at your training,
in your dev and in your test sets,
you realize there's some overlap.
And you need to mention that,
we do in the footnote.
Any questions around the experiments
over the set of this model?
Now, these are just some
more results of the various,
just a subset of the many people who have
worked on all these different tasks, and
sort of the comparison.
And this is generally something that
I've encouraged you to do in all your
proper projects, but also something
that you'll see in most good papers.
You usually have two sets of tables.
One set of tables is about
you comparing your best model
to all other people's best models.
And then the other subset of tables is
about understanding your model better with
ablations and
modifications to your model and
decisions that you made about your model.
And so this set of tables here
is basically the comparison to
all the other folks,
who have worked on those tasks.
And in many cases,
it's basically the state of the art model.
And this is just one of the many tables
of this paper that tries to understand
all of the various combinations, and
which tasks help which other tasks.
All right, any questions about
joint many task learning
So the question is what's the key insight
that made this model work to be honest,
there are a bunch of them and
they all matter a little bit.
So having a better word
representations with character n-grams
help just a little bit.
In the paper,
you'll see how much they all help.
And then, having the short
circuit connections helped and
we have a table that shows all the deltas
for having the short circuit connections.
From all of the lower level tasks,
outputs, and
label embeddings directly
to the higher level tasks.
And then, the successive regularization
helped a little bit also, so yeah.
It's actually a sequence of things.
There's no single insight other than
of course having this main model.
So we also have a table that shows how
much it helps to have three layers for
all these tasks versus a three layer LSTM,
where all the tasks are output at the same
height or same depth of the network.
And we show that it works better
if they're actually sort of,
each task has its own LSTM.
So yeah, it's a combination of those and
because there's so
many moving pieces, there's so many,
over a dozen tables in the paper
that show how much each helps for
each of the five different tasks.
No, many of them we
invented in this paper.
So they weren't available back then.
And they're, So the question is, and this
is something I brought up myself, right?
Can you actually add some of
these things to other models?
And so the word vectors, for
instance, are an idea that you
could add to all the other models.
The success of regularization
doesn't really make
sense unless you have successive layers,
which no one had really done for
more than two tasks before.
Some of the model architectures and
the differences are just very novel and
then you have to think of what
models would actually do this.
The majority of papers published
on these different tasks
aren't extendable in that kind of way,
right?
They're, for instance, graphical models,
where it wouldn't be obvious to just plug
this vector into this other thing and
something would happen.
So it's hard to use these insights
on a lot of these previous models.
Or they have convolutional operators
instead of LSTMs, and so you don't have
a nice sort of, at this time step I have
this representation and things like that.
But yeah, at least the word vectors and
character engrams,
that's a pretty general insight
that a lot of people could use.
All right, awesome.
Now, another obstacle that we also
discussed already briefly before
is that we don't have zero
shot word predictions.
And what do I mean by this?
In almost all the cases,
the various models that we described,
like the machine translation models,
have softmax at the end.
And you can only predict the words
that you've seen at training time.
And we've also already covered
how to fix this with pointers.
And you'll see now in PA4 already
that we also have there not just
a pointer to a single word but
pointers to spans of words, so
beginning and end token pointers.
And actually a interesting side note here.
Again, we've covered this already.
But you can also, in this PA4, and
for general question answering,
try to predict a sequence of single words,
with a set of pointers like this.
It actually turns out to not work as
well as pointing to the beginning,
learning to point to the beginning token,
and then the end token.
That works better by 2 to 5% or
so, depending on how you do it,
than pointing to a sequence
of different words.
Basically you make two decisions versus
having to make five decisions if you point
to a span of five words.
All right, now let's have our research
highlight on Neural Turing Machines.
Take it away Nish.
>> Thanks Richard.
Hi everyone, today I'll be presenting
on Neural Turing Machines.
So we'll be covering two papers.
One on the Neural Turing Machine itself.
And then a second paper on
differential neural computers,
which was both of these
papers were from DeepMind.
And we'll be seeing the architecture
proposed in the first paper and
then the results from the second paper.
The architecture modification of
the second paper is only slight and we can
really just wanna take the high level idea
that these architectures have introduced.
So all the neural networks
that we have seen in class so
far excel at pattern matching.
So you might have heard
of DeepMind's agent
that played Atari games such as
Breakout with superhuman performance.
And these tests are relatively easy for
the network because they have
to make very reactive decisions.
However, when it comes to
reasoning from knowledge,
neural networks still struggle at that.
Consider the problem of
finding the shortest path.
Now in our introductory algorithm classes,
any algorithm, such as DFS, or
breadth-first search,
usually requires us to store which
nodes we have visited before.
In the current architectures that we
have seen so far, it's really hard for
networks to store that information.
So the solution to this
is having more memory.
But you might be worried, like LSTMs
didn't they already have memory cells?
It is a valid question,
but this is not the right kind
of memory we are looking for.
So if you understand systems peak, if you
consider LSTM's memory cell as a cache.
What we really need is random
access memory, or RAM.
And this is where
Neural Turing Machines come in.
So you can, okay, yeah,
in this architecture
diagram, the controller is an RNN.
And it decides whether to read and
write from the memory cells.
And we'll see how both of these
operations are implemented.
How does reading and writing work?
If you have taken previous
systems classes at Stanford,
you might have realized that memory is
inherently very fundamentally discrete.
So how do we make it differentiable,
because we need to optimize
it using back propagation.
And the answer to that is our friendly
method of attention mechanism.
Which is read and write everywhere,
but just for different extents.
And you'll see about how
we go about doing that.
So, how does reading from memory work?
So we have this memory vector and
we have been provided with an attention
vector corresponding to it.
So, in this case, the first element,
I'm zero indexing here.
The first element of the attention vector
is blue, which means it has high value.
And so we read the first element
from the memory vector itself.
And it's a weighted sum, so
given the attention vector,
the reading would be different.
Similarly, in terms of writing, we have
our old memory and we have a write value.
We want to write everywhere, but how much
do we write each value in the memory by?
And, again,we use
the attention mechanism here.
So you can see that the second
element is blue here.
And although the write value and
memory at that location is at
opposition locations you can
see that at the new memory.
The vector has shrunk just
because of different magnitudes.
Because of similar magnitudes
in the opposite direction.
It's a convex combination of the write
value as well as the attention.
So, in both of these cases of read and
write,
we assumed that we had
a correct attention vector.
And how do we go about
actually getting that?
The controller has a query vector, and
it looks at each point in the memory and
performs a dot product.
And only to get which one
it's more similar to.
So in this diagram,
blue indicated high similarity and
pink indicates very high dissimilarity.
We perform a softmax, get the memory
that has the most attention.
Now we have the attention
from the previous step.
We interpolate from that
attention to finally get
what part of the memory vector
we should be focusing on now.
Finally, we can perform the shift vector.
Now this is what enables us to read
at different locations around that
focused attention.
And we then sharpen it to get our
final attention distribution.
This final attention distribution is then
fed into the read and write operations.
We can now see your result, I'm not
sure if the media has been incorporated.
Okay, let's see.
Just know that this video is from
the differential neural computers which is
a slightly new architecture compared
to the Neural Turing Machine.
But uses the overlying same principle
of having an external memory bank.
So, in this case, our task is to
infer relations from a family tree.
In most cases it's a graph traversal
problem, as well as a storage problem.
And standard LSTM would struggle
really hard at this problem,
which is where
Neural Turing Machines really shine.
Keep in mind that the memory vector
is being updated as we see for
each one, right.
Just like to acknowledge the papers and
the resources I used, and
back to Richard, thanks a lot.
>> Yeah.
>> [APPLAUSE]
>> All right, now to another obstacle.
And that is that we actually
have multiple superfluous,
if you will, word representations.
So I mentioned that we share Word2Vec and
GloVe kinds of pre-training vectors.
And now if we train a output, such as for
machine translation or language modeling.
We'll actually have another
set of weights in the Softmax,
one vector for every single word that we
have in the output, the Softmax output.
Now what that means is at the top
here we have this large Softmax.
It is the size of our vocabulary times
the hidden dimensions of the LSTM.
And in the input, we also have word
vectors for every word vectors.
So, again, the same size v times
the size of our word vectors.
Now, a really cool paper and result and
idea that actually came from two students
who took this class or 224D last year.
Was to tie these two vectors together,
just say they have to be the exact same,
vectors.
So your Softmax weights for every word are
the exact same as your input word vectors.
And you train them both jointly,
you just back propagate.
Take the same derivatives, but
now, they are actually the same.
It's very easy to implement if you don't
have to take the derivatives yourself,
and tensor flow and so on.
Would be a little harder otherwise.
They also have some really nice
theory about the Softmax and
various sort of temperatures
when you do this.
But we're not gonna go
into all those details.
But basically, it's a very simple idea and
it turns out to quite significantly help.
So, here we basically have, again,
this language modeling task
over the Penn Treebank.
We mentioned that at this
Pointer Sentinel idea got 70.9.
And then these very,
very large sort of 38 different LSTMs.
For instance,
with 2.5 billion parameters, I get 68.
But this simple idea,
where we basically tie
the word vectors together with
just 51 million parameters.
I can get the lowest test perplexity
when that paper came out.
Which is kind of incredible,
like the amount, again,
the speed in which this perplexity
has now been reduced more and more.
And we're better and
better able to predict this next
word is kind of incredible.
So it's a very simple idea that
you can actually use every time
you have an output space that includes
all the words in your vocabulary.
As well as word vectors you
can use this idea and one,
you're reducing one of the largest
sets of parameters in your model.
So you use less ram,
you can have larger mini batches, or
you can train faster,
use less gpu ram and everything.
And it's more statistically efficient,
whenever you see a word.
And the output,
it benefits also its input representation.
So very neat idea, very simple,
gives you a nice improvement.
Any questions about this idea?
It's one of those nice examples
where everybody kind of assumes,
you just have a Softmax,
and you have word vectors.
So nobody really thinks about it, and
then sometimes people think about it.
And question some of the basic
assumptions of the field, and
find a way to do a better job, so.
It's a really cool result and
one of the best projects from that class.
Now, obstacle 5 is something
that's very relevant to PA4.
So we spend a little bit more time on
it But basically tackles the problem
that in many cases questions that we might
ask a system have representations that
are independent of the current context or
the input that we might have.
So, a kind of fun example is the question,
may I cut you, should be interpreted
very differently if I am holding a knife
or whether you're standing in line right.
And so you might want to have your
question be reinterpreted given
the context and the input and the reason I
brought up the dynamic memory network is
that this is in some ways a further
refinement of this kind of idea.
You will still have some kind of
document encoder, you'll have some kind
of question encoders,
you 'll have an answer module but
this answer module actually
predict indices of the answers.
And then you have this coattention encoder
instead of this episodic memory module you
have seen before and now this coattention
encode looks kind of complicated.
And is a little bit complicated in
real life but not too badly so so
let's walk a little bit through it and
the paper gives you all the equations.
And this is a reasonable model to
try to implement again once you
to have your baselines implemented and
bug free.
You can really actually in many ways
start from just this first step here,
similar to the pseudocode I gave you.
And then several of these modules you
can actually add one by one and see for
each one how much it improves.
And in fact, Caiming Xiong is
the first author of this paper,
that's exactly what he did.
He looked at it, looked at errors, and
then tried to add more coattention.
And then tried to add LSTM to
incorporate all the facts again for
multiple time steps and things like that.
So there's kind of a hill climbing on
the architecture kind of approach.
So and on a very high level,
let's say you have a question queue here.
And you have the hidden states of an LSTM.
And you have some document input D here.
And you have m + 1 steps here.
You actually have this sentinels
too that's why it's +1.
Now what you can do is essentially take
the inner products between all these
hidden states.
And that's how you get these
sort of context matrices and
then you can multiply these again with
the hidden states in these products.
And you can concatenate
various combinations of
these products between
these two sets of vectors.
So you have these outer products
compute these context vectors and
then you concatenate them in multiple
ways until you have the final state here.
And now that one,
you'll pipe each input here,
you pipe it through
a bidirectional LSTM again.
And that will now be the question
dependent interpretation
of every word in your input document.
So basically just a lot
of inter products and
outer products between
the hidden states of two LSTMs.
Such that you understand how related
is this time step of this question at
this word of the question, to this time
step at that word at the input document.
Lots of inner and outer products and
then you try to agglomerate
all these different facts again
in the bidirectional LSTM.
And now once you have an output
a hidden state of that LSTM that will
be given as input to a classifier
that essentially tries to identify.
And classify with these highway networks,
basically just neural networks
with short circuit connections.
At each location of this now question
dependent input representation
you classified which of
these is the start token.
And that start token is then given its
input to yet another neural network
that will now take the previous start
token that we classified together
with a potential end token across
all these different vectors.
You hear from the question dependent input
representation to classify the output.
And you can do that multiple times and
once they don't change the input and
the start and the end tokens are the same
from the previous time step,
you'll basically stop.
So the reason we call this dynamic, here,
is that you do this multiple times, and
your first iteration might be wrong.
But you give that input so the argmax
is the highest resulting hidden state.
This could be the 51st time step for
instance, the word turbine.
You give that as input to this LSTM,
which was then given again,
its output given to the input of
another iteration of this attempt
at predicting the start and end token.
Now In a simpler world, let's say when
you eventually get to this model,
but you might implement the whole thing
and you might be very optimistic,
just implement the whole thing and
then it doesn't work.
What do you do to debug?
Well, you just take out all
the different things and
you try to do the simplest thing,
which starts exactly at that pseudo code
I had in the very beginning of the class.
You just have LSTM for
input, LSTM for question.
And then you pipe each state of
the input into a neural network and
you try to classify start and end token.
And you might have some outer products
between them, and you plug those
into a straight up neural network and
you classify start and end token.
Then you might concatenate
these two outer products and
just classify those start end token.
If you eventually have that whole
coattention encoder you could then say,
all right, now I just classify
independently the start and
the end tokens from that representation
of the question dependent encorder.
Just independent one classifier for
the start token,
one classifier for the end token.
And then you can go on.
And each time it will take some time and
you run some experiment but as long as you
sort of incrementally improve each step
you know that you didn't introduce a bug.
And so whenever there is sort of general
bug fixing, you wanna have you wanna try
to identify where your bugs might be as
you build the larger and larger system.
And so if you start from something simple
that you know works reasonable well and
is bug free then each time
you add something to it.
And it improves the accuracy you can be
fairly certain that there's no new part,
not always but for the most part.
And so this is a you know in the end
a very complex system that puts a lot of
these simpler steps together.
We actually again have sort of
introduced all of the basic components,
basically of this but there again,
sort of put together in a very novel way.
And you already know
the Stanford Question Answering Dataset,
unless of course you're doing a project
that has nothing to do with PA 4.
So I'll just describe it
a little bit briefly, sorry for
the folks who are doing PA 4 and are
intricately familiar with this already.
So the Stanford Question Answering Dataset
is a really great dataset of 100,000 plus
question answer input triplets.
And the way it's constructed is that for
each question the answer
has to be a particular span in
the input paragraph for the most part.
Sort of short documents but
really mostly paragraphs.
So when you ask, what is Donald Davis
credited with what's great also is
they actually have multiple people
answering the same question,
cuz sometimes it's ambiguous.
So one ground truth answer might be Davis
is credited with coining the modern name
packet switching and inspiring numerous
packet switching networks in Europe.
Another person might just say he's
credited with just coining the modern name
Packet switching and inspiring
numerous packet switching networks, or
even shorter, just coining
the modern name Packet Switching.
And we would assume that all of
them are reasonably correct and
close enough, and if your model
predicts one, that it's good enough.
Great data set, now again these whenever
you put a results table in it's already
deprecated, actually one thing that was
really great to see, I just noticed today.
Let's see if I can find this,
is the model now,
this is the SQUAD website.
Again, sorry to bore the folks who are
working on PA4 and not on the problem set.
It's a really great new
phenomenon that I think we'll see
also as we push the limits of not
just deep learning for NLP, but
I think in general,
Of machine learning and AI.
So have proper trained dev test splits,
nobody sees the test set.
You have to submit your code, so
that makes it more reproducible
in the future too,
if people are willing to open source their
codes, of course, you don't have to here.
And it's I think in general
a great way to improve the science
of what is mostly an engineering
discipline, we're creating new systems and
so you see here different systems and now
you also can see when they were submitted.
So some groups were super active.
Now there's kind of,
this is my group, submit it.
>> [LAUGH]
>> Four months ago and
this is when that paper came out,
and when this table happened.
And so since the last four months
we worked on other things,
and now this is not the state
of the art anymore.
And there are lots of people who
are just this week submitted more, but
at the time of submission this was kind
of this dynamic co-attention network,
was the best model on squad,
the first one sort of push it above 80.
What's also great is to actually
have human baseline and
that is something that will make sense for
you too sometimes.
And I have had several
students groups also and
in their problem set work on a task, and
then they say, I look at my errors now,
which is great,
always do careful error analysis,
something we would definitely want to
see in your report in the posters.
When does your model fail,
what can it not capture yet?
And sometimes, you look at your errors and
you actually say,
I actually agree more with my
model than with the data set,
the official ground truth label
is actually kind of wrong.
There's also just people, they were busy,
they had to make money on AMT or
something, Crowd workers, right?
Maybe they weren't properly filtered and
so on.
And eventually you might hit
an upper limit of just what that
data set can ever give you.
And so it's good to have
this kind of human baseline.
Here the human baseline is
sort of 91 in terms of F1,
or the exact match of 82.
And you know once you push above that,
really you're just fitting to the noise
of that data set in some sense.
And so
that is good if you're at that level, and
it also helps to feel less bad if you have
a new data set, you created it yourself.
It's good to know that
it's okay to be at 85,
because if I ask two people they
would only agree in 85% of the cases.
So this inter-annotator
agreement is pretty important
to consider as your pushing your
numbers sort of higher and higher.
Any questions on SQUADs,
the dynamic content neural network, yeah?
I don't actually know all
the details of who they asked,
it may have been just the first author.
It's the Turkers and
their interannotator agreement.
So maybe, okay, so
then if that's the case then basically
you can look at how often do these,
training set.
Explore.
So how often do people actually agree
with when they write their answers?
So here, there's perfect agreement between
the humans, but here, it might not be.
So one might say, what did the church
claim could be avoided with money?
God's punishment for sin or
versus just God's punishment.
Or late medieval Catholic church
versus just a Catholic church.
So they're different, sometimes
different people agree differently.
And it doesn't make sense for your model
to try to agree more with any single
human, than humans between one another.
How do you say its performance
exceeds human performance?
So you can try to do that by
basically saying, all right,
humans agree this often with other humans.
You can create an output that other
humans would be more likely to agree with
than with one another that's one way.
Or you say I will take five or ten experts
in the world about a certain thing.
This actually becomes more important for
like medical diagnosis, when you wanna
also make those kinds of claims or
just train really accurate algorithms.
You could basically take a group
of experts, and you only select
those where the majority of the experts
agree on what the output should be.
And if you then agree more often with
the majority than any single doctor
would agree with that majority then
you can claim super human accuracy.
So what are the principles behind
sort of claiming a novel algorithm?
So I guess in some ways it's kind of
out of the scope of the question,
cuz it's a legal question.
I think in general, novelty of algorithms
is something that is also in the eyes
of the reader so, that's not really
a good scientific answer to the question.
No, I guess in general, a lot of these
papers are submitted to conferences and so
the question, whether they're novel
enough, kind of often is subjective and
in the eyes of the reviewer.
Which can also not always
be the right thing,
because two or
three reviewers can also be wrong.
So then here's a nice visualization,
again,
something I would encourage you all to
do for your projects and problem sets.
I would just basically in this case trying
to understand if this dynamic encoder
having an extra LSTM layer on top of
just predicting a single start and
end token once will actually help.
And here we can kind of see it helping.
So as you go through this, it's kind
of hard to read, but basically this is
an input, and then you see the outputs of
the classifier of this highway network.
And how certain it is that
a certain word is a start token, so
66, end token 66 with just
a single word as a start token,
versus having the start token be 84 and
the end token be 94.
And actually it switches from the first
attempt at classifying the right span
to the second, and
in this case more correct.
All right,
now the second to last obstacle, one thing
you've noticed in a lot of things and
in a lot of these more complex models is,
that we actually use recurrent neural
networks as the basic building block for
a lot of the different deep
learning NLP systems that we have.
And sadly, those recurrent neural
network blocks are usually quite slow.
And unlike convolutional neural networks,
they can't be parallelized as easily.
And so the idea here is to basically
take the best and parallelizable
parts from RNNs and convolutional
neural networks, respectively.
And try to combine them in one model,
and this resulted in
the Quasi-Recurrent Neural Network by
James, Stephen and Caiming and me.
And this is essentially the description
of this quasi-recurrent neural network.
So, in general,
the very first layer of an LSTM,
where you just pipe something
through the single-word vector,
you might be able to parallelize.
But then as soon as you actually
take into consideration
the previous time step
HT-1 in your LSTM cell,
you have to wait until that's computed
before you can compute your new one.
And so you can parallelize that.
On the other hand, in the convolutional
neural network, you can parallelize
the convolution really well because it
only depends on two consecutive inputs.
But then,
with the max pulling you don't actually
get a hidden state at every time step.
But for many things like sequence
classification or identifying spans and
things like that,
we would actually like to have such a
hidden representation at every time step.
And so the idea on a high
level of the QRNN is to have
a parallelizable convolutional layer.
And then have a parallelizable
element-wise pooling layer
that just looks independently
at each feature I mentioned and
computes these gates that we already know.
So in some ways in
combines the CNN that we
looked at with the gated and
LSTM type gates.
And so we can write this as
a very simple description, right?
This is something that
should look familiar to you.
But instead of having Xht- 1 here,
you just have Xt- 1 and Xt.
So you don't have to wait until you
computed the previous hidden time step.
You're just making these gating decisions
based on two consecutive word vectors.
And you have multiple layers of these so
this is just the first layer here.
And you basically just have
a standard neural network.
It's not recurrent, just concatenating
two input vectors at a time.
And you sum up after that and
you have tanh or sigmoids,
depending on what kind of gates you have.
So now, this you can rewrite as
a convolutional operator where
you have a set of weights,
Wz over your input X.
You just basically multiply,
it's a pretty discreet computation.
Once you write it like this, you can also
think of larger filter sizes, or windows.
You could actually have Xt- 2, Xt- 1, and
Xt in each time step, for instance.
Does this make sense as an operator?
Can't you just can't you just
compute the gates at each time step?
So the question is you're
splitting a cell and
then you're parallelizing
across each dimensions.
>> I don't see what's parallel about this.
>> So good question.
So why's this parallel and
why can we parallelize this?
Let's say you have these
five word vectors here,
x1, x2, x3, x4, and x5.
Now, at each time step what you do
is you basically take two as input.
Take these two as input and
you compute a vector such as z, all right.
And now, you do this basically for
all these, for all these pairs.
Basically, you just move one over.
Now, the reason we can parallelize this
is basically because we can concatenate
a large matrix that just has x1 x2,
and then x2 x3, and x3 x4, and so on.
And we can basically preprocess
our input in this format and
then just multiply that same
matrix with all of those.
And hence we can parallelize all of them.
None of these computations depend
on the previous hidden state.
So that's why this can be parallelize
across the time dimensions by
basically just smartly
preprocessing the input.
And then, the element-wise gate here can
also be parallelized across channels.
So all of these are just elements-wise
multiplications of these gates and
of the hidden states.
And so you just multiply,
let's say you have 100 features,
100 of these computations over time can
be done independently of one another.
So ht here, for
the first dimension of my feature channel,
can be multiplied independently of h2 of
the feature channel and h3, and so on.
The ith element of the feature channel is
independent of all
the non-i feature channels.
Yeah?
That's right, so here you can
parallelize this part across time,
but this only across feature channels.
So here,
you have the ith element of ht depends
only on the ith element of f, h, and z.
But now, you parallelize differently, you
parallelize across the feature channels,
not across time.
So you basically parallelize here,
parallelize this.
Then once you have all of these,
you parallelize this again and
you can parallelize this again.
But you have to wait,
you can't compute the third layer before
you compute the first and the second.
So what's great about this is it turns
out to sometimes actually be better
than LSTM for a couple of parameters and
settings and tasks that we ran.
And it's certainly a lot faster,
especially once you're implemented
properly with cuDNN kernels and
CUDA kernels and really dig in.
If you just kind of multiply it in Python,
you might not be able to optimize
the architecture as well.
And you won't get these
kinds of speed ups.
So depending on your batch size,
each of your mini-batches, and
depending on the sequence lengths,
you can get up to sort of 16x speed ups.
But if you have very large batch sizes and
very short sequences, then of course,
that parallelization will buy you less and
you'll only get a 1.4 speed up or so.
When you look at how much
of the computation for
this kind of model is now spent
on what kind of operation.
What's amazing is basically if the Q-RNN,
the recurrent types
of multiplications and just computation
is actually very small now.
This is language modeling, so
we have large vocabulary in our softmax.
The majority of time here is spent
on the softmax classification only.
And then there's a little bit of just
optimization overhead getting things onto
the GPU reading, and getting the word
vectors and all of that stuff.
Sometimes, they're also sort of easier to
interpret cuz we have this now independent
feature dimensions.
And so it can actually go into this demo,
okay can't see it,
Where we re-visualize this.
And this is also,
this is kind of nice to have.
You don't have to do this,
but if you have extra time,
you already have good
performance on your models,
it's always nice to have some
interactive plots to interact.
If you have your question
answering system figured out,
you can write a little JavaScript or
some maybe even just command line thing.
Type in a question and
see what the answer is, give it a new
kind of input, a new kind of document.
See if you can break it, how it breaks or
what the activations look like and
things like that.
So here we basically had a QRNN
trained on sentiment analysis.
And then looked at the activations
as it goes over a very large and
very long document.
So here you can see what is
there to say about this movie,
this movie is simply gorgeous.
So once you hit gorgeous, you see that
a lot of the different activations for
several of the neurons really strongly
change in their activation, so
right at this location here.
Simply gorgeous,
a true feast for the eyes.
And now, some of these hidden activations,
stay the same no matter what
other sort of content there is.
So game set standard for
3D role playing games seven years ago,
this movie sets the standard for
future CG movies.
And then you've got these trailers,
blah blah blah.
So this is kind of this idea that I
mentioned in the very beginning about
having these different gates and
now this is for sentiment,
and nothing changes much.
Just kind of talks about content, but
then when the movie does not disappoint,
some of the neurons will switch again.
And then there's another sort of seemingly
pretty important change in this review so
it's not exactly a bad story.
That doesn't sound super positive so a lot
of these neurons again will switch around.
And then, at the end here I
recommend this movie to everyone and
you see okay, several of the neurons
again turning on very strongly and
eventually classified this
as a positive review.
So those are super nice to have if you
can try to visualize your model that way.
And now in the last five
minutes I want to talk about
very recent paper from Quoc Le,
also graduate
from Stanford now in the Google Brain
team, actually a founding member of it.
Working with somebody else,
Zoph here, where, basically,
they realized, and this is very good
insight, more and more of our time as
researchers is spent on creating complex,
neural network architectures.
And in some ways, if ideally,
it would be better if we could
actually have an AI select the right
architectures for what we do.
And again, putting more A back into AI and
have less human ingenuity and
human architecture design.
So in some ways, and this is kind of an
introspective sort of end thought also for
the class.
We've moved from something in
the beginning where we said we do all this
feature engineering back
in the day in the field.
And now look everything is much better
cause now we have these architectures that
end to end trainable and
they learn all these features.
But as were now trying to improve numbers
more and more and performance and
create new kinds of capabilities for
our deep learning NLP models.
We catch ourselves designing
architectures more and
more just like we used to
design features anymore.
We're humans, we want to use our
intelligence in some positive way.
And so basically, the idea here is
to use artificial intelligence to
find the right architecture for
a whole host of different kinds of
problems or for a very specific problem.
And without going into too many details,
the main basic
controller that we'll have is also
going to be a recurrent neural network.
But the outputs of that recurrent
neural network are actually
architecture hyper parameters,
if you will.
So how many hidden layers should I have,
or how big should my hidden layer be?
At each time step of
the recurrent neural network,
I will output those kinds of features.
And then whenever it outputs that,
it will try to train a child network
with that kind of architecture to get
a certain accuracy on a single task.
And then feed that back.
Now it's very hard to actually,
it's not differential because
as you make these various discreet kinds
of decisions about the whole architecture,
so they use reinforcement learning,
which we haven't really covered in class.
And in the last two minutes of class we're
not going to able to really do it any
justice, so it's just a different learning
regime than the standard back propagation.
This is kind of we trained the CNN what
these outputs would look like at one
time step.
It might predict a number of filters and
then the filter height and
filter width and the stride size, how far
do you skip to either the next words?
Or in computer vision, how many
pixels do you jump over, and so on?
So basically, this kind of architecture
selects its own architecture for
specific problem that
you say is its reward.
And, remember when I told you the numbers
are getting better and better,
this is again this data set on language
modeling where just a couple months ago,
we're super excited to have these
pointers, and when we got to 70.
And then we're super excited cuz
we're tied to word vectors, and
we got to 66, or a 68.
Now, with this incredible
new idea to actually
human ingenuity is still important,
they also share the embedding.
So it's sharing word vectors and
soft mix outputs.
It's not something that that model could
have ever predicted and it's something
that helps a lot, but in general,
they choose different kinds of cells.
Instead of LSTMs, they learn what kinds of
cells should be used at every recurrent
time step and with that,
get to an amazing 62 perplexity.
It's really incredible how quickly
these numbers have plummeted.
And just when you thought you had a good
intuition of why LSTMs work well,
this is basically the kind of architecture
that this model comes up with.
And there is no more like this gate, and
what happens when this gate happens.
This kind of the model figured
out how to do it well and
did it in the end incredibly well.
All right, so basically there
are still a lot of limits that we need
to tackle as a field, still can't do
general purpose, question answering.
We still can't really do
complex multitask learning,
where we might have the same architecture,
do machine translation and
question answering, and
sentiment analysis, and so on.
They're all still very
specialized architectures.
We don't have systems that
can do multi-modal reasoning.
So over images or
speech together with logical reasoning,
together with memory based retrieval,
there's still a lot of work to be done.
And really, all the systems right
now require us to have tons of data,
but I can introduce a new word to you
like Jane hit John with an uftcha.
I just made up the word uftcha, but now
you can make lots of various assumptions
about how heavy it could be, is it a
physical object and not a mental concept.
How big it would be, how heavy, and
all these different logical kind of
conclusions, from a single example.
All the systems we've looked at in this
class require ton of different kinds of
examples and lot's of statistical patterns
that we essentially need to match.
All right, with that congratulations,
you've made it.
Good luck on your last couple
of days of your projects.
Thanks to all the TAs, thanks to Chris.
Good time, good luck.
>> [APPLAUSE]
