The plan for today is what I am gonna talk about
is the topic of convolutional neural networks.
So essentially, um, there's actually quite a lot of
content in this lecture of different things that's good to know about,
since essentially this is going to be learn about
convolutional neural networks in one large bite for NLP.
So, um, bit on announcements,
explain the general idea of convolutional neural networks,
and then for quite a bit of it,
I want to go through in sort of some detail to particular papers that made
use of convolutional neural networks for
text classification, sentence classification tasks.
Um, the first is a sort of a pretty simple,
um, CNN that was done in 2014,
and then the second one is a
way more complex CNN that was done much more recently in 2017.
Okay. But first, a couple of announcements.
Um, firstly, the last reminder on the mid-quarter feedback survey.
So tons of you have done the- this already.
Thank you, thank you very much.
Um, but if you'd still be putting it off till the very last minute, um,
tonight at midnight is your last chance, um,
to fill in the mid-quarter survey to get your,
um, to give us feedback and to get your half-a-point.
Um, okay. And then the other thing that you should be thinking about,
and I know lots of you are thinking about
since I spent three hours talking to people yesterday,
um, is about final projects.
Um, and so make sure you've got some plans from that, um,
in place for, um,
04:00 p.m, uh, 04:30 p.m. Thursday.
I mean, in particular as we've discussed, um,
your- part of what you're meant to do this year is to have found some research paper,
have read it, and have a summary and thoughts as to how it can inform your work.
Um, and then just make sure you have in your calendars, um,
the final project poster session for CS224n,
which is gonna be in the evening of Wednesday March 20th,
and we're holding it at the Alumni Center.
Okay. Um, one more sort of announcement or just general stuff to cogitate.
Um, so we're now officially in the second half of the class.
Congratulations.
Um, and, you know,
there's sort of still a few things that we want to teach you that are sort of basic,
and actually convolutional neural networks is one of them.
But, I mean, nevertheless in the second half of the class, I mean,
things start to change and we're hoping to much more, um,
prepare you for being real deep learning NLP researchers or practitioners.
And so what does that mean concretely?
Well, the lectures start to be less
giving every detail of how to build a very basic thing,
and more giving you some ideas
to sort of some of the work that's been done in different areas.
And so to the extent that there's something of interest or
rele- relevant to a project or things like that.
Um, the hope is that while you can take some initiative to
find out more about some of the things that are being talked about.
Um, also would really welcome any questions about things that people,
um, would want to know more about.
And the other thing that you should know about
deep learning is that once we get past the fundamentals,
a lot of the stuff we teach just isn't
really known science or things that people are sure of that,
you know, most of what I'm teaching in the second half of the course is pretty
much what people think is good practice in 2019.
But, you know, the fact of the matter is what people think is
good practice in deep learning has been changing really rapidly.
So if you go back even two years or definitely if you go back four years, right?
There's just a lot of different things that people used to believe,
and now people have some different ideas as to what works best.
And it's perfectly clear that come 2021 or 2023,
there will be some different ideas again as to what,
um, people think is best.
So you sort of just have to accept that this is, um,
a nascent rapidly emerging field
and it's good to understand the fundamentals and how things fit together.
But after that, quite a bit of the knowledge is this is what people
think is good at the moment and it keeps evolving over time.
And if you want to stay in the field, or doing things with deep learning,
you kind of still have to keep up with how it changes.
It's called lifelong learning these days.
It's a very trendy concept.
Um, and so as well as the lectures,
this is also true for the assignments.
Um, and, you know,
we've been trying to make the assignments so that they started off very introductory,
and gradually started to use less scaffolding,
and we're going to hope to, um,
continue that, um, with the sort of less hand holding in assignment five.
And, you know, I guess what we're hoping to do is prepare you
both for the final project and for real life.
I guess I was making an analogy this morning,
um, comparing this to the sort of intro CS sequence,
so when there's CS106A and B that have tons of scaffolding,
and then in CS107,
you're meant to learn how to diagnose and solve problems
for yourself in a debugger that is kind of the same,
um, for neural networks that, you know,
for the early assignments, um, you know,
we've given you every bit of handholding here, all of
these tests to make sure every little bit of it is okay,
and here's exactly how to structure things.
But, you know, in the real world,
um, you're only going to be able to build and use neural networks.
If you can figure out why they're not working
and what you have to change to make them work.
And, you know, the truth is as I talked a bit about last week, you know,
that's often well more than half of the job that it seems easy enough to stick down.
Here's my neural net and the pieces that make sense to me,
and then you can spend the remaining 80 percent of the time
scratching your head wondering why it doesn't actually work well,
and how you could change it to make it to work well.
Um, so, um, I confess that debugging neural nets can often be hard, but, you know,
the goal is that you should actually learn something about doing it,
and that's kind of one of the learning goals of the course when it comes down to it.
Um, final little advertisement.
If you feel like you'd like to read a book,
um, just out this week,
there's a new book on natural language processing with PyTorch
by Delip Rao and Brian McMahan.
Delip actually lives in San Francisco.
Um, so, um, if you want to,
you can buy a copy of this, of course.
But if you don't want to, um,
buy it and you feel like having a bit of a look through it, um,
the Stanford library is actually has a license to the O'Reilly's Safari Books collection.
So you can start off at library.stanford.edu and read it for free.
There's one catch to this which is the library only has
16 simultaneous licenses to Safari Books.
So if you'd also like your classmates to be able to read it for free,
it really helps if you remember to log out of Safari Books Online,
um, when you're done looking at it.
Um, yes, so this is sort of a,
I mean, in some sense,
I hope you will feel if you look at this book,
"Boy, I already know most of that stuff already.
It's not a super advanced book.
But it's a good well-written tutorial of how to do things with PyTorch and NLP."
If you don't feel like you know most of the stuff in this book,
you can let me know but I will be a little sad.
Um, okay, um, yeah.
So, let, so starting into today.
Um, so, we spent a lot of time on
recurrent neural networks and they are great for many things.
Um, but there's sort of some things that they're not so good at.
So, you know, we kind of might like to know about a phrase like my birth,
or a bigger phrase like of my birth,
and there's sort of no independent, um,
representation of those spans in a recurrent neural network.
We kind of get sort of prefixes of a whole sentence.
And while we did, um, bidirectional, um,
recurrent neural networks, and you could say, 'Well,
wait a minute you could use it in both directions' and to some extent that's true.
We can get stuff from this direction and stuff from this direction,
but we still kind of have sort of
whole sequences that go to one end of the sentence or another.
We don't just have pieces of sentences.
And often, we'd like to sort of work out meanings of pieces of sentences,
and so, we sort of have two problems here.
We only have sort of initial and final sub-sequences.
And also, if you look at these representations, like if you say,
take this last state as the representation of the meaning of this text.
What you find out,
is it's very dominated by the meaning of
the most recent words and what they are trying to predict as to what comes after them,
and that's part of the reason why I mentioned
last time in the question answering, um, lecture,
the idea that well you can do better by having a sentinel and training
something that has attention over the whole, um, LSTM structure.
Okay. But today we're going to look at
a different alternative which is convolutional neural nets,
which are often abbreviated as either CNN's or ConvNets.
Um, and the idea of these is, well,
look maybe we could just take
every sub-sequence of a certain length and calculate a representation for it, um,
so that, you know, if we have some piece of text like,
tentative deal reached to keep government open,
and we could sort of just say, well,
let's just take all three words sequences,
tentative deal reached, deal reached to,
reached to keep et cetera,
and we're going to calculate some kind of representation for each of those sequences.
So, this is an- isn't a strongly linguistic idea.
Right? We're not worrying about whether it's a coherent phrase,
that's grammatical linguistically valid,
cognitively plausible, we're just taking every sub-sequence of a certain length.
And then, once we've calculated representations of those,
we're going to look at how to group them.
Okay. So, let's get into more detail as to what CNN's are and how they work.
Um, yeah, so, there's this general idea of a convolution which you may or may
not have seen in some math or electrical engineering class.
And then, there's the particular version of convolutions,
the discrete convolutions, which you can means that
you can use the friendly summation symbol rather than an integral.
Um, and that's a,
that's a discrete convolution.
I find that that notation as completely unhelpful.
So, I won't even try and explain it.
But I've got lots of examples,
and convolutions are really easy for neural nets in terms of what they do for examples.
All right, so the classic case of where convolutional neural networks are used,
is in vision applications.
So, if you do CS231N next quarter,
essentially you know,  the first four weeks is just all doing
convolutional neural networks in all their variants and glory.
Um, and the sort of essential idea of, um,
convolutions for a vision,
is that you want to recognize things no matter where they appear in an image.
So, you have a sort of property of translation and variance,
and the idea of a convolution as a way
of finding something in different places in the image,
regardless of where it appears.
Um, so this is the vision example which I stole from Andrew Ng's UFLDL website.
And so, what a convolution is,
is it's here a patch,
but you can think of it as just as a vector,
and the patch has weights which are these little numbers in red,
and what you're gonna do,
is slide that patch over the image as this as this animation does.
Um, and so at each position,
you're going to multiply each of the red numbers by the black number in that position,
and then you're just going to sum them up.
So, that's what a discrete convolution does,
which is what that notation at the top is saying,
right? You're multiplying things together and then you're summing them up,
and so you're doing this,
and then you're filling in the pink with the products,
um, the sum products.
So, it's sort of like, you're taking these sort of
patch dot products and putting them into the pink matrix,
and that's then your convolved feature.
So, that's a 2D convolution,
which for the rest of today,
we're not going to look at anymore.
So, this is all you're learning about vision.
Um, and so we're now going to go back and look at 1D convolutions,
which is what people use when they're using convolutional neural networks for text.
So, the starting point of a convolutional neural network for text,
is we have an input.
So, here's my sentence and for each word
in the sentence I have here got a dense word vector,
I made it a 4D, want to keep it small in my example but usually as you know, it's more.
So, our starting point is we have some input, you know,
input could just be a one-hot encoding that's not forbidden here,
but normally we'll have these kind of dense word vectors.
And so, then it's sort of the same as the 3D as the 2D one,
apart from we've only got one dimension.
So, we have a filter.
Um, so here is our filter,
and so our filter is gonna do three steps and time, three words.
And that's going to work across the dimensions.
So, these different dimensions in
the convolutional neural network often get referred to as channels.
So, we're kind of working across the input channels,
and so we have a patch like this.
And we're going to take this patch and put it on top of the first three words.
I don't have as good an animation as the previous slide.
Sorry. And we're going to work out the dot product,
um, between those, and I did that at home by putting this into Excel.
And the answer [LAUGHTER] to that,
is that the product is minus 1.0.
And then at that point, we slide our,
We slide this, um,
matrix which gets referred to as a kernel or
a filter which is the patch that we're using for our convolutional neural network.
We slide it down one and do the dot product of those terms again.
And that comes out as minus a half and we keep on sliding that down and we get what,
um, gets what's shown on the right as our output.
So at this point,
we've just reduced the sentence,
um, to a single vector.
Um, and that seems like we might want to do more than that.
Um, but the other thing that you will have noticed is that
our sentence is sort of shrunk because before, you know,
we had a seven word sentence but because I've just sort of slid this three word,
um, kernel down here,
I ended up with only five positions to put it in.
So it's become a five word thing.
Um, so to first of all address that problem,
commonly when people do convolutional neural networks, they add padding.
Um, so what I can do is I can add zero padding at
both ends and then sort of do the same trick and say run a convolution on that.
And now, I'll be able to put my size three filter in seven different places as I
slide it down and so I'm getting out a vector that's the same length of my input.
Um, that, you know, there are different way,
so this is the most common way of doing things.
And it's kind of seems logical because it maintains size.
I mean, you know, there's always more than one way to do it.
Um, if you really wanted to,
you, oops, I don't want you, yeah,
there, oops, I made, uh,
I made a slight mistake on my slide because this
turns out which I was about to get to in a minute
but I'll just explain this bit here anyway [LAUGHTER].
Um, you know, if you wanted to,
you could have two steps of padding on both ends here.
So that your first convolution we'll be looking at zero, zero,
10 to the of and then the convolution would actually grow the size of your input.
Yeah. But, yes. So I mean,
so what we've done so far,
we've started with these word vectors which in
convolutional neural networks terms were of length four.
So our kind of input had four channels.
But when we were back here, um,
we were just producing from this, um,
kernel, one column of output.
So our output has only a single channel.
So we've sort of shrunk things in the columns direction from four to one.
And that might seem bad.
And for many purposes, it is bad.
Um, and so, a lot of the time,
what you want to do is to say,
well, rather than have only one filter,
instead of that, why don't I have several filters?
So here I've got three different filters and each of
these filters is just sort of the same size three,
three the size, the kernel size times the input,
number of channels for the matrix.
So I have three different filters and I'm going to run
each one down the text and get a column here.
So now, I'm ending up with three columns of output.
And so I have this sort of a three channel output.
And the way to intuitively think of this is for these filters,
well, you know, for what we do in neural networks,
we're going to learn them by backpropagation like everything else.
But our hope is that these filters could somehow specialize in different things.
So maybe this filter could specialize on,
is this language polite?
And it will produce a high value whenever it sees polite words.
And maybe, um, this, um,
filter could specialize on, I don't know,
eating and it will have a high value whenever it sees words
about food and you know this filter will do a third thing.
And so that's the sense in which people sometimes talk about, um, the, um,
what you're getting is output of different features because your hope is that
you'll kind of gain different latent features coming out of the text.
Okay. So that gives us a representation and that's sort of
a useful sort of having found learn features in our text.
That quite often though, what we'll want to do is just
summarize the text with re- with respect to those features.
So you might just have the question of, well,
in this piece of text, um,
is it polite and does it talk about food?
So another operation that we'll quite often
do is wanna summarize the output of a convolutional network.
And the simplest way to do that,
is for 1D convolutions,
is called max pooling over time.
So if we max pool over time,
that each of the channels or otherwise known as features,
we're just simply going to look down and see what is its maximum value, 0.3, 1.6, 1.4.
Um, and so, you know,
if I use my story about the first two, um,
filters, it's sort of saying, well,
it's not very polite text but it's really about food, right?
That we're sort of summarizing,
um, what we've detected there.
Um, so the concept of max pooling in some sense captures,
does, is this thing being activated anywhere, right?
So if we have things like politeness and about food,
that the output of max pooling will have a high value.
If somewhere in the sentence there was a clear marker of
politeness or something clearly about food.
And that's often a useful notion because often what you want to know is,
you know, is there some discussion of food in this sentence or is there not?
There's another thing, there are other things that you could do.
Instead of, ah, max pooling,
you can instead do average pooling.
So here you just take these numbers and find the average of them.
That then has the different semantics which is sort of
what's the average amount of politeness of this, um,
text or on average how much, you know, how,
what percent of the sentence is about food or something like that.
Um, for some purposes,
this is better because, you know,
it takes in all of the important builds to an average.
I mean, a lot of the time,
people have found that actually max pooling is better because,
you know, a lot of signals in natural language are sparse.
You know, no matter how polite you are trying to be,
you're not going to be being polite in every word.
You're going to say nouns and articles like that and a,
and prepositions and conjunctions,
none of which are inherently polite, right?
Um, so that if there's some politeness showing up prominently,
then the sentence becomes polite and max pooling is actually better for capturing that.
Um, of course the one other kind of thing that you can do as
min pooling and find the least [LAUGHTER] active thing.
Um, it doesn't get used much but you could do that as well.
Okay. So, um, so if you're in PyTorch,
this is all pretty easy stuff to do.
So there's a handy dandy Conv1d.
There's also a Conv2d as you might guess for vision.
But there's a Conv1d, um,
where you're specifying how many input channels there are.
That was our word embedding size.
How many output channels there are?
We have three.
What the size of the convolutional kernel is?
So the ones that we were showing were also
three and then there are various other parameters you can have.
Like you can say that you want a padding of one and things like that.
And then once you've got one of those,
you can just sort of run
your convolutional filter on the input to get a new hidden state.
And then if you wanna max pool,
you can just max,
um, through the output of that and then you've got a max pooled output.
Okay. So that gives us the basics of building a kind of a convolutional neural network,
um, for, um, NLP.
Does that sort of makes sense up until there?
Yeah. Okay. So next bit is to sort of show
you three or four other things that you can do.
Um, I started off typing these slides
other less useful notions because I
kinda thought, oh, at least they don't really come up much in NLP.
But, you know, actually it turned out when I got on to that second paper,
when I say the complex convolutional neural network, actually,
in that paper they try out just about all of these things that I say no one uses.
So it's sort of good to know what they are for looking at various papers.
So here, when we did things so far then we were calculating these convolutions,
that we're sort of trying them out at every position.
So we had one for zero, tentative deal.
Then for tentative deal reached then deal reached to.
And so we were just walking down one step at
a time which is referred to as a stride as, of one.
And that's by far the most common thing to do.
But you could observe,
look wait a minute,
since the first convolution concerns zero tentative deal.
I've got all those three words in there.
Even if I skip down to a next did, deal reach to and then I did to keep government,
I'd still have in one or other of the convolutions every word of the sentence
so I can do half as much computation and I've
still got everything in there in some sense.
And so that's referred to as using a stride of two.
And so then I get something with half as many rows out.
So it's one way to sort of compactify your representation and produce
something shorter from a longer sentence and we'll see that use of it coming up later.
There's other ways to compactify what cut representation that comes out of your sentence.
And so there's a different notion of pooling which is local pooling.
Now, if if you've seen any of
the vision world when people talk about max pooling and vision,
they normally mean local pooling as opposed to
the max pooling through time that I showed you first.
So here we're sort of back to where we started and we've done
our size three stride one convolution which is producing output as before.
But now, what I'm gonna do is local pool with a stride of two.
Which means I'm gonna take each two rows and I'm gonna pool them together into
one row and I could do that again by
either maxing or averaging or whatever appeals to me.
So I take the first two rows,
I max pool them I get this.
I take the next two rows,
I max pool them I get this.
Next two, next two and I sort of pad it
on the bottom so I have two rows at the bottom.
And so that's then give me a local max pooling of a stride of two.
And that sort of had exactly the same effect in the sense but
with a different result as using a stride of two in
my convolution because I have again reduced it to
something of four rows that used to be eight rows.
Yeah, picture that.
Okay so that's that one.
What else can you do.
There are more things you can do to make it complex.
Another thing that people have sometimes done is k-max pooling.
And so this is a more complex thing and it's sort of saying well,
rather than just keeping the max over time,
if a feature is being kind of activated two or three times in the sentence,
maybe it'd be good to record all the times that it's
activated in the sentence while throwing away the rest.
So in k-max pooling,
and I'm doing two max here,
you look down this column and you find the two highest values for that column.
But then you put the two highest values not in the order of highest to lowest,
but in the order in which they are in these columns.
So it's minus 0.2,
0.3 for this one and it's 1.6,
0.6 for this one because it reflects the orders of the columns up above.
Okay. Almost done, one more concept.
This is another way of compressing data which is a dilated convolution.
So if you have a dilated convolution,
so dilated convolution doing it over here doesn't really make sense but where you can use
a dilated convolution is if I take this and put it through another convolutional layer,
we can kind of have deep convolutional networks that have multiple convolutional layers.
So the idea of a dilated convolution issue is you're gonna skip some of the rows.
So if you use a dilation of two starting at the top,
you're going to take the first, third,
and the fifth row and multiply them by my fil- sorry,
I have different filters.
Multiply them by my filters and then get the values that appear here.
And then if stride as one,
you'd then use, you would go on and sort of do the next spread out rows.
And so this allows you to have convolutions that see
a bigger spread of the sentence without having many parameters.
So you don't have to do things this way.
You could have said, look,
I could just instead have convolutions with a kernel size of five.
And then they'd say five,
see five words in a row but then I'd be having
sort of bigger matrices to specify my feature.
Whereas, this way I can keep the matrices small but still
see a bigger range of the sentence in one operation.
Yeah and that concept of how much of a sentence you
see is kind of an important notion in convolutional neural networks.
Because, you know, if you start at the beginning of a sentence
and you're just running three-by-three convolutions, um,
you're sort of seeing these three word patches of the sentence.
And it turns out in natural language that's
already actually quite a useful representation.
Because sort of having those kind of n-grams as features is
just good for many purposes including text classification.
But if you want to sort of understand more of the semantics of a sentence,
somehow you wanna see more of that at once.
And you've sort of got several tools you can use to see more of it once,
you can use bigger filters,
you could use, uh,
kernel size five, seven,
nine or something convolution.
You could do something like dilated convolution so you can see spread out pictures.
And the third thing that you can do is you
can have depth of a convolutional neural network.
Because as you have greater depth of a convolutional neural network, you see more.
So at this first layer,
the rows now have sort of info about three words in them.
And if you sort of just stuck a second layer of
convolutional neural network with
the same general nature on top of it and you sort of take
the first three rows and convolve it again then and
then the next ones that those then know about five words of your original input sentence.
So as you kind of have a deeper ConvNet stack you
start to know about bigger and bigger patches of the sentence.
Okay. All good?
Any questions?
No, that's good, okay. So, um, the next piece is essentially shows you this stuff again,
um, in the context of a particular paper.
So this was, um,
a paper by Yoon Kim who was a Harvard student,
maybe still is a Harvard student, um, in 2014.
So this was sort of a fairly early paper.
Um, and he wanted to show that you could use convolutional neural networks to do
a good job for doing
text classification when what you want to classify is a single sentence.
So, the kind of thing you might want to do is look at the kind of
snippets of movie reviews that you see on the Rotten Tomatoes site and say,
"Is this a positive or is this a negative sentence description?"
And the model he built is actually kind of similar
to the convolutional neural networks that Collobert and Weston,
um, introduced in their 2011 paper that we
mentioned before when we were talking about window-based classifiers.
So, in their paper they actually use
both window-based classifiers and the convolutional classifier.
Okay. Um, so yeah,
I sort of already said this.
So their tasks are sentence classification, could be sentiment.
It could be other things like,
is this sentence subjective or objective?
So objective is what the main news articles are meant
to be and subjective is what the opinion pieces are meant to be.
Um, and then other things like question classification.
Is this a question asking about a person,
location, number, or whatever?
Okay, so here is what he did.
And it's sort of the- these slides sort of, um,
use the notation of his paper which is sort of a little bit different the
way the math gets written down to what I just showed
you, that it's really doing exactly the same thing.
So we start with word vectors of length k. Um,
the sentence is made by just concatenating all of those word vectors together and then,
when we- so we have a range of words,
it's a subpart of that sentence vector.
And so, the convolutional filter is just being represented as a vector because
here he's flattened everything out into one long vector for the entire sentence,
whereas I'd sort of stepped into a matrix.
Um, so a size three convolution is just a real vector of length hk,
the size of the convolutional filter times the dimensionality of the words.
Um, and so, what he's gonna do to build
his text classifier is use convolutions made out of different sizes.
So you can have size two convolutions,
size three convolutions as shown here, and bigger convolutions.
And so, um, so to compute a feature one channel for our CNN, we're
then doing a dot product between the weight vector of
the feature times this sub-sequence of the same terms,
and he sort of also put in a bias which I sort of omitted.
Um, and then putting it through a non-linearity,
um, which I wasn't doing either.
Um, but as sort of we've seen a ton of.
Um, and so, what we're wanting to do is that's our,
um, feature and we want to, um,
do it through all this- for a feature of kernel size three,
we're gonna go all the way through the sentence.
The other thing he did though was slightly funnel funny is,
his windows were sort of lopsided in the notation, right.
There's a word and th- the,
um, h minus 1 words to the right of it.
So he has padding here just on the right end whereas
most people do their convolutions symmetrically in both directions around things.
Okay. And so, we're going to do that for a bunch of features or
channels Ci and therefore compute
our convolved representations just as we've talked about.
Okay. Um, then he does just what we talked about.
Um, there's max over time pooling in the pooling layer to capture
the most relevant things and is giving us a single number for each channel.
Um, and we have features that look at different that have different kernel sizes.
Um, here's one other idea he used which is possibly a neat idea.
Um, he knows one of the things that you could even think about in various ways,
um, for say a question answering system among other things.
Um, and so he used pre-trained word vectors.
Um, but what he did was he actually kind of doubled the word vectors.
So, for each word he had two copies of the word vector,
and so you have sort of two channel sets and one set he
froze and the other one he fine tuned as he trained.
So it's sort of he tried to get the best of both worlds of sort of fine tuning
and not fine tuning and all that went into the max pooling operation.
Okay. Um, so, after the max pooling we get out one number for each channel and so,
um, he has something of three size convolutions, three,
four, five, 100 features for each size.
So we're getting out a vector of size,
um, 300 at that point,
and at that point you're taking that final vector and just sticking it
through a softmax and that's then giving your classification of the classes.
Um, so all of that can be summarized in this picture if it's big enough to sort of read.
So, here's our sentence.
I like this movie very much,
which has you know, our word embedding dimension is five,
and so then doing it in this example,
we are having two channels for each kernel size and
we consider kernels of size two, three, and four.
Um, and so and then we are getting two different ones.
Um, so we're getting, um, six.
This is showing six of our filters.
Um, so we apply those.
When we- when we apply those filters without any padding,
we are then getting out these outputs of the filters which are of sizes four,
five, and six respectively.
Um, and so then once we've got these
for each of these sets of numbers we're doing one max pooling.
So, we're just taking the max of each of these,
um, output features which gives us these six numbers.
Um, we can concatenate them all together into one vector which we feed into,
um, a softmax over two classes as to whether sentiment is positive or negative.
Um, so that's basically the model.
So something- so this is sort of really actually a very simple,
very computationally efficient, uh,
model as to how to build a text classifier.
[NOISE] Um, yeah, just a couple more things to get through,
um, so in one of the assignments,
we talked about Dropout [NOISE] and you used it.
So, um, you know,
hopefully you're all masters of Dropout at this point.
Um, so he was using Dropout, um,
and this being 2014 and the,
um, Dropout paper only coming out in 2014.
I guess, there'd been an earlier version that came out a couple of years earlier.
This was sort of still fairly early,
um, to be taking advantage of Dropout.
So that while training,
you've got this sort of Dropout vector, um,
where you sample your Bernoulli random variables and you're, sort of,
um, sort of, designed to drop out some of the features each time you are doing things.
At testing time, you don't do the dropout,
but because before you were sort of dropping out a lot of stuff,
you're scaling your weight matrix by the same probability that you use for dropping out,
so that you get, sort of,
vectors of the same scale as before.
Um, so as we sort of discussed in the assignment,
Dropout is a really effective form of regularization,
widely used in neural networks.
Um, he didn't only do that, he actually did,
a kind of another sort of funky form of regularization.
So that's for the softmax weight vector,
he constrained the L2 norms,
so the squared norms of the weight vectors and the softmax, [NOISE] um,
matrix, um, to a fixed number S,
which was sort of set of the hyper-parameters,
actually set to the value three.
Um, and if your weights were getting too large,
they were being rescaled,
um, so they didn't blow up.
Um, this isn't a very common thing to do.
I'm not sure it's very necessary, um, but, um,
I guess it gives you some- I mean,
I guess by showing you a few of the details of this one,
my hope is, sort of,
gives you some ideas about how there are lots of things you can play
around with and muck with if you wanna try different things,
um, for your final projects.
Um, okay.
So here are some of his final hyperparameters.
So he's using ReLU nonlinearities,
um, window sizes of three, four, and five,
the convolutions, hundred features or channels for each size,
um, Dropout of a half as usual.
Um, you get several percentage improvements from dropout,
which is quite common actually.
Um, the sort of L2 constraint, s equals three,
mini batch of 50,
300 dimensional word vectors,
train to maximize dev set performance.
Okay. And here is the big table,
you know, I was too lazy, um,
to redo of performance on these different text classification data sets.
Um, there are lots of different ones.
So these two are both Stanford Sentiment Treebank.
This is the Subjective Objective Language.
This is the Question Classification, of is it asking for a person name and location,
a company or whatever.
Um, this is, um,
talking about, sort of, a perspective,
which is another classification thing.
Consumer Reports is another sentiment one.
Um, so lots of data sets and then here are lots of models.
So the model- some of the models down here or here,
are traditional feature-based, um, classifiers.
Um, so in particular,
um, sort of Wang and me back in 2012,
had sort of pointed out that by taking certain steps
with n-gram features and other forms of normalization,
that you could actually get quite good results with
just the traditional feature, um, based classifiers.
So many people use that as a baseline for showing that you can do better things.
Um, the ones up here,
were tree structured neural networks that my group was very fond
of in the early 2010s and then up at the very top,
uh, his CNN models.
And as you can see,
it's sort of a mix.
Sometimes the CNN model wins,
like in this column and this column,
sometimes it doesn't win like in these columns.
Um, but in general, um,
what you didn't see from this is that, you know,
this is an extremely simple, um,
convolutional neural network model and it actually does,
um, kind of well on this system.
Um, you can quibble with this results table,
and again in terms of like writing your propos- project proposal, um,
one thing that you should do is kind of think about what you're reading, um,
because, you know, a lot of papers aren't perfect
and there are reasons to quibble with what they claim.
And sometimes if you think about what they're claiming and whether it's reasonable, um,
there are reasons why it's not or there are ideas
of how you could do things differently or show something different.
I mean, the main reason why you could quibble with,
um, Yoon Kim's results table is, well,
he already said, as I had a couple of slides back, um,
that the statement that Dropout gives you
two to four percent accuracy improvement in this neural nets.
[NOISE] Um, but most of these systems because they
are older and were done before Dropout was invented,
um, didn't make use of Dropout.
But, you know, any of these sort of neural net systems up here
could have used Dropout and presumably it would have given them a couple of,
um, percent gain as well.
So arguably, this is sort of a biased, unfair comparison.
And the right thing would have been to be comparing all the systems, um, using Dropout.
Um, but, you know,
despite that, you know,
this was still a prett- a lot of people noticed
this paper because it showed that using this sort of very simple,
very fast convolutional architecture,
could give you strong results for text classification.
Um, that's that.
Yes. So in summary,
you know, something that you should be thinking about for projects and otherwise,
we're effectively building up a bigger toolkit of different tools you could be using,
um, for projects or future work or whatever it is.
So starting off with,
we had word vectors and then we could build bag of
vector models by just taking the word vectors and averaging them.
And, you know, that's actually a surprisingly good baseline to start with.
We suggest to you in many cases for things like projects,
you should use that.
See how well it does,
make sure you're working better.
I mean particularly, you can do even better with that,
if you sort of add some extra ReLU layers on top,
which is an idea that's been explored in deep averaging networks.
Um, then we looked at window models which were very simple.
You're just taking these sort of
five word windows and computing a feed-forward network on them,
and they work very well for word classification problems that only need local context.
Things like, part of speech tagging or NER.
But then we've gone ahead and looked at some other models.
And so, um, CNN's are very good for text classification, um,
and they're very good because they parallelize really well on GPUs,
which is something I'll come back to again later.
So they, they just sort- the general sort of representing sentence meaning.
They're actually a efficient,
versatile, good method, which has been used quite a bit.
And then they sort of contrast with recurrent neural networks.
Recurrent neural networks have some advantages.
They're sort of more cognitively plausible,
because you're sort of reading through the text and,
um, getting its meaning.
Um, recurrent neural networks are good for
things like sequence tagging and classification,
building language models to predict what's coming next.
Um, they can do really well when combined with attention.
Um, but they also have some disadvantages.
They're way slower than convolutional neural networks and if what you wanna
do is get out some kind of overall meaning representation of a sentence,
you know, "What does this mean?
Are these two, um,
phrases paraphrases with each other?"
There are now many results that show that people
don't get better results with recurrent neural networks.
They can get better results using techniques like convolutional neural networks.
Okay. [NOISE] So in the next step then [NOISE] is to,
sort of, head towards our com- our complex,
um, convolutional architecture example.
So before getting to that,
I just wanna sort of introduce a few concepts that we haven't seen,
all of which, um, start to turn up when we do this.
So we spent a lot of time in the sequence models part,
talking about gated models or the gated recurrent units and the LSTM units.
But the idea of a gate is general that we can
sort of have this idea that we can calculate something,
put it through, um,
a sigmoid nonlinearity and gets a value between zero and one,
um, or a vector of values between zero and one.
And then do a Hadamard product with a vector
and sort of gate it between its value and zero.
So that suggests the idea that you could also apply
gates vertically when you're building multilayer networks.
And after the successive LSTMs had been proven,
that was, um, an idea that really took off,
was people start exploring,
how can we have, use these ideas of skip connections and gating in a,
in a vertical direction?
And here are two versions of it.
This one is a very simple one,
but a very successful one that's basically just about a skip connection.
So and this is referred to as a residual block and- which is used in residual networks,
otherwise known as ResNets.
Um, so in a residual block, for each block,
you allow a value just to skip ahead to the next, um, layer.
Or you can stick it through a conv block,
and the typical conv block is you go through a convolutional layer,
you then go through a ReLU nonlinearity,
another convolutional layer, and then when you come out,
you just sum these two values.
So this is the same idea that sort of
summing values is magical in the same way as an LSTM.
And then you put the output of that through another ReLU,
and this thing here is called a residual block
and then commonly you'll stack residual blocks on top of each other.
Um, there's one little trick here,
um, which is you need to use padding, right?
Um, because at the end of the day since you want to sum these two pathways,
you want them to be the same size.
And if you, sort of,
have them shrinking in the conv blocks you wouldn't be able to sum them.
So you want to, sort of, have a padding at each stage so they stay the same size here,
and so that you can add them together.
Um, here's, um, a different version of a block which is
sort of more LSTM-ish and indeed
this block was developed by Jürgen Schmidhuber and students,
who's the same guy who's behind LSTMs and you can see the same thinking.
It's called a highway block.
So in a way it's sort of similar.
You've got, you know, kind of thinking of moving an identity x that skips
a nonlinear block or you can have it go through exactly the same stuff conv, relu, conv.
The difference is that unlike this one,
this time there's explicit gates so there's,
um, and this T-gate and the C-gate.
And so you're multiplying both of the path through here and the path through here
by a gate just kinda like the sort of
the get input gates that we saw before and then summing them together.
So that sort of feels more
powerful but it's not actually clear that it is more powerful.
I mean, this one actually has a very simple
semantic because if you think of the semantics of this one
is the default is just you walk
this way and you just sort of carry forward your value and do nothing.
Um, so, what this block's job to- is to do,
is to learn a delta that is meant to learn
what kind of deviation you have from doing nothing.
Um, so that's a nice simple semantic which, um,
seems to work well in neural networks to learn things.
Um, this sort of has
more complicated apparent semantics because you're taking, you know,
some parts of the identity multiplying by this sort of gate in a Hadamard product
and some parts of this conv block multiplied by this other gate T in a Hadamard product.
So that sort of feels more powerful as that
gives me a lot more control because I can take pieces of the different ones and so on.
If you think about it for a bit longer, I mean,
mathematically it's actually not any more powerful that you
can represent anything you can do with this one with that one.
And the way to think about that is well, um,
you know, here you're kind of keeping only part of the identity,
um, but what you could do is keep the whole of the identity and see it as your job
to subtract off the bits that this one isn't keeping
over here in the conv block which you can do theoretically.
Um, and so, you can sort of anything you can compute with this as a function,
you can actually compute with a, um, ResNet block.
Um, and so then as quite often in neural network land,
the question isn't sort of, um,
some kind of proof of compute- can be computed or not.
It sort of comes down to learning and regularization questions as to
whether one or the other of these actually proves
better as something to use in a learning architecture.
Okay. Second concept.
Um, batch normalization.
So when people are building deep convolutional neural networks,
um, in the 2015 pluses,
um, they almost always use batch normalization layers because
this makes your life a lot better and if they're not using batch normalization layers,
they're normally using one of the other variant ideas that people have suggested
such as layer normalization which is sort of meant to do about the same thing.
Um, so what batch normalization does?
I mean, I think many of you will have seen somewhere in steps or
otherwise the idea of doing a Z-transform which means you take your data,
you work out its mean and you work out its
standard deviation and then you rescale by subtraction and
multiplication so that you have a set of data which
has a mean of zero and a standard deviation of one.
Most people see that, right?
Yeah? Um, so batch normalization is effectively doing exactly that but in a weird way.
So what you're doing is that you're taking each mini batch.
So whatever just random 32 examples you've stuck in a mini batch,
you're running them through a layer of
your neural network like a ConvBlock that we saw before
and you take the output of that mini batch and then you do a Z-transform on it.
Um, and then it goes forward into the next ConvBlock or whatever,
and the next time you have a different mini batch,
you just Z-transform it.
So it seems a little bit weird.
You're just doing it on the output of these mini batches.
Um, but that's proven to be a very effective thing to do.
So that it sort of means that what comes out of
a ConvBlock sort of always has the same kind of scale.
So it doesn't sort of fluctuate a lot and mess things up and it tends to
make the models just much more reliably trainable because,
you know, you just have to be much less fussy about a lot of things.
Because, you know, a lot of the things we've talked about,
about initializing your parameters and
setting your learning rates is sort of about, well,
you have to keep the scale of things about right so they don't get
too big or too small and things like that.
Whereas, if you're doing this batch normalization,
you're sort of forcing scale,
um, to being the same size each time.
And s o therefore, you kind of don't have to do
the other stuff as well and it still tends to,
um, work pretty well.
So that's a good technique to know about.
Okay. Um, one last thing to learn about.
Um, there's a concept of,
um, size one convolutions.
Um, and actually, I guess I really sort of, um,
renamed it- I named this wrong because I wrote down
one by one convolutions because that's the term you normally see.
But that's, um, the vision world where you have 2D convolutions.
So I guess I should have just called this one convolutions.
So you can have convolutions, um,
with a kernel size of one and when you first see that,
it seems like that makes no sense whatsoever because the whole idea
of a convolution was I was taking this patch and calculating something from it.
If I'm not looking at any other words,
surely I'm calculating nothing.
But what actually happens in the size one convolution,
is if you have a number of channels that
sort of in a previous layer if you'd calculated whatever it was,
32 channels or something like that.
What the one by one convolution is doing is acting as
a tiny little embedded fully-connected network over those channels.
And so you're sort of doing a
position specific fully-connected network,
um, in- for each row of your data.
And so you can do that,
um, for various reasons.
You can do it because you want to map down from having
a lot of channels to having fewer channels or
you can do it just because you think another non-linearity
will help and this is a really cheap way to do it.
Because the crucial thing to notice is that if you sort
of put fully-connected layers over everything,
they involve a lot of parameters whereas putting in these size
one convolutions involve very few parameters
because you're just doing it at the level of a single word.
Um, okay.
Um, two random things and then I'll go onto my complex model.
Um, this is just a sort of
almost a bias- aside but it just shows
something different that you could do and it's something that you could play with.
I mean, when we talked about machine translation,
we talk about the SIC to SIC architecture that was introduced in
2014 and has been very successful for machine translation.
But actually, the year before that came out,
um, there was a paper, um,
doing neural machine translation by Nal Kalchbrenner and Phil Blunsom in the UK.
And this sort of was actually essentially
the first neural machine translation paper of the modern era.
If you dig back far enough,
there are actually a couple of people that tried to use
neural networks for machine translation
in the '80s and '90s but this was sort of the first one that restarted it,
and they didn't actually use a SIC to SIC architecture.
So what they used was for the encoder,
they used the convolutional neural networks.
And so that they had a stack of convolutional neural networks that progressively shrunk
down the input and then finally pulled it to get a sentence representation,
and then they used a sequence model as the decoder.
Um, so, um, that's sort of something that you could
try in some other applications that for encoders,
it's really easy to use convolutional neural networks.
There has been work on using convolutional neural networks as decoders as well,
though that's a little bit harder to get your brain around and isn't used nearly as much.
Then the second thing I want to mention because we'll turn to it in just a minute is so,
so far we've done Convolutional models over words so that
our kernels are effectively picking up
these word n-gram units of two-word or three word sub-sequences.
And the idea that then developed fairly soon was well maybe
it would also be useful to use convolutions over characters.
So, you could run a convolutional neural network
over the characters of the word to try and,
um, generate a word embedding, um,
and this idea has been explored quite a lot, um,
it's part of what you guys are gonna do for assignment
five is build a character level ConvNet,
um, for your improved machine translation system.
I'm not going to say sort of a huge amount about the foundations of this today, um,
because Thursday's lecture is then talking about subword models
and we'll go through all the details of different subword models.
But, I wanted to show you a con- a complex
convolutional neural network which is also used for text classification.
So, essentially, the same task as Yoon Kim's model
and this model actually is built on characters,
it's not built on words.
So, we are at the foundation of it,
um, having a word-like model.
Um, so, this is a paper from 2017,
um, by, um, the four authors shown here, um,
people working at Facebook AI Research,
um, in France, um, and so,
they kind of had an interesting hypothesis for
this paper which was essentially to say, that, you know,
by 2017 people who are using deep learning for vision were building really,
really deep networks and fi- finding that they work much,
much better for vision tasks.
So, essentially to some extend,
the breakthrough was these guys that once these ideas that emerged,
it then proved that it wasn't just that you could build a six layer or an eight layer,
um, Convolutional Neural Network for vision tasks.
You could start building really,
really deep networks for vision tasks which had tens or even hundreds of
layers and that those models when trained on a lot of data proved to work even better.
So, um, if that's what's in your head and you then looked,
look at what was and indeed is happening in natural language processing,
the observation is, you know,
these NLP people are kind of pathetic,
they claim they're doing deep learning but they're still working with three layer LSTMs.
Surely, we can make some progress, um,
by building really deep networks that kinda look like vision networks and using them,
um, for natural language processing goals.
And so, that is precisely what they said about doing.
So, that they designed and built really deep network which sort of looks like a vision stack,
um, as a convolutional neural network that is built over characters.
Um, so, I've got the picture of it here but sufficiently deep that it's fitting it on
the slide and making it readable [LAUGHTER] is a little bit
of a challenge but we can try and look at this.
So, at the bottom,
we have the text, um,
which is a sequence of characters and so, um,
for the text, um, so,
when people do vision object recognition on
pictures normally all the pictures are made the same size.
Right. You make every picture 300 pixels by 300 pixels or something like that.
So, they do exactly the same for NLP, um,
they have a size, um,
for their document which is 1024 characters.
If it's longer than that they truncate it and keep the first part.
If it's shorter than that they pad it until it's of
size 1024 and then they're gonna stick it into their stack.
So, the first part is that for each character,
they're going to learn a character embedding now and
their character embeddings are of dimensionality 16.
So, that the piece of text is now 16 by 1024, um, so,
they're going to stick that through a convolutional layer where
you've got kernel size of three and 64 output channels.
So you now have something that's 64 times of 1024 in size.
You now stick this through a convolutional block.
I'll explain the details of that convolutional block on the next slide but,
you should be thinking of that ResNet picture I showed earlier where you
can either be going through some convolutions or taking this optional shortcut.
Another ResNet, another residual block
where you can be going through convolutions are an optional shortcut,
um, they're then doing local pooling in the same way people typically do envision.
So, commonly what people do in vision systems
is you are sort of shrinking the size of the images, um,
by doing pooling that halves the dimensions in each direction.
But, at the same time,
you do that in your neural network,
you expand the number of channels,
and so you make it deeper in terms of the number of
channels at the same time as you make it smaller in the x,
y size of the image.
So, they do exactly the same apart from these one-dimensional convolutions.
So, before we had 64 channels in our 1024 character,
um, embedding, um, document.
So, now we pool it, um, so,
we're going to have 512 positions which are sort of like pairs of characters,
um, but we now have 128 channels
and then they kind of repeat that over and over again, right?
So, there are two more convolutional blocks which I'll
explain more but they're sort of residual blocks.
They pool it again and they do exactly the same thing.
So, now there are 256, um,
positions which are like four character blocks and they have 256 channels,
um, I can't point high enough but they repeat that again and they pool again.
So, now they've got, um,
128 positions which are about eight characters
each and they have 512 channels representing that.
They pool again, they have convolutional blocks again, um,
then lo and behold because I said that even the
weird ideas are going to turn up, right up there,
they're doing k max pooling and they're keeping the eight strongest values,
um, in each channel.
Um, and so at that point,
they've got something of size 512 by eight, um, so,
sort of like eight of the eight character sequences
have been deemed important to the classification and they're
kept but they sort per channel and there are 512 of them
you're then putting that through three fully connected layers.
So, typically vision systems at the top
have a couple of fully connected layers at the end,
um, and the very last one of those,
is effectively sort of feeding into your Softmax.
So, it's size 2,048 times the number of
classes which might just be positive negative two class unlike the topical classes.
Um, so, yeah, so it's essentially like
a vision stack but they're going to use it for language.
Um, okay.
So, the bit that I hand quite explained was
these convolutional blocks but it sort of looks like the picture that we had before or,
um, departments slightly more complicated.
So you're doing, um,
a convolutional block of size three
convolutions some number of channels depending on where you are in the sequence.
You're then putting it through a batch norm as we just
talked about putting it through a ReLu non-linearity,
repeating all those three things again or remember there
was this sort of skipped connection that went right around the outside of this block.
And so this is sort of a residual style block, um, so,
that's the kind of complex architecture you can put together and
try in your final projects if you dare in PyTorch.
Um, yeah, um, so,
for experiments so- so one of
the things that they were interested in and wanted to make a point of is well some
of these traditional sentence and
text classification datasets have been used in other papers
like Yoon Kim's paper are effectively quite small.
So, something like that Rotten Tomatoes dataset is actually only 10,000 examples, 5,000,
positive 5,000 negative and they sort of have
the idea that just like ImageNet was needed for
deep learning models to really show their worth and vision
that probably does show the value of a huge model like that.
Um, you need to have really big datasets.
So, they get some much bigger,
um, text classification datasets.
So, here's an Amazon review positive-negative dataset, um,
with which they have sort of 3.6 million documents,
um, Yelp reviews 650,000 documents.
So much bigger datasets,
um, and here are their experiments.
Okay. So, the numbers at the top, uh,
for the different datasets of the best previous result printed in the literature,
and then if you read the, um,
footnotes, um, there are a few things that they want to sort of star.
So, the ones that have a star next to them use
an external thesaurus which they don't use. [NOISE]
And the Yang method, um,
use some special techniques as well that I cut off.
Um, and the other thing to mention is these numbers,
they're error rates, so low is good.
Um, so the lower you get them, the better.
And so then these are all of their results.
Um, and so what can you get out of these results?
Um, well, the first thing that you can notice is basically with these results,
the deeper networks are working better, right?
So, the one I showed you,
uh, well, no, I think the one that I have the picture of this isn't the full thing.
Um, but they have ones with depth 9, 17,
and 29 in terms of the number of convolutional layers,
and the deepest one is always the one that's working best.
So, that's a proof of deep networks.
Um, that didn't keep on working, um,
so an interesting footnote here is,
um, I guess they thought,
oh, this is cool.
Why don't we try an even deeper one that has 47 layers and see how well that works?
And, I mean, the results were sort of interesting for that.
So, for the 47 layer one,
it worked a fraction worse than this one.
Um, so in one sense you,
they showed the result of sort of residual layers work really well.
So, they did an experiment of let's try to train
a 47-layer network without using residual connections.
And, well, it was a lot worse.
The numbers went down about two percent.
And they trained one with residual connections,
and the fact of the matter is the numbers were just a teeny weeny bit worse.
They were sort of 0.1 of a percent worse.
So, you know, they sort of work just about as well.
But, nevertheless, that's kind of different to the situation in vision,
because for the sort of residual networks that people are using in vision,
this is sort of like the very minimum depth that people use.
So, if you're using residual networks in vision typically,
you might use ResNet-34.
If you're really short on memory and want to have a small model,
but you just know you'd get better results if you used ResNet-50,
and in fact, if you used ResNet-101 it'd work even better again.
Um, and so that somehow, you know,
whether it's got to do with the different nature of
language or the amounts of data or something,
you haven't yet gone to the same depth that you can in vision.
Um, but other results, um,
so the other thing they're comparing here is that they're comparing
three different ways of sort of stringing things down.
So, you could be using, um,
the stride in the Convolution,
you can be using local MaxPooling,
and you could be using KMaxPooling.
Um, and they're general,
they're slightly different numbers as you can see.
Each one, um, wins and one, uh,
at least one of these datasets or actually at least two of these datasets.
But not only does MaxPooling win for four of the datasets,
if you sort of look at the numbers,
MaxPooling always does pretty well.
Because MaxPooling does pretty well here,
whereas the convolutional stride works badly,
and over here MaxPooling works pretty well,
and the, um, KMaxPooling works kind of badly.
So, their recommendation at the end of the day is you should always use, um,
just MaxPooling of a simple kind,
that that seems to be fine,
um, and nothing else.
Um, it's actually worth the trouble of thinking about doing.
Okay. Um, was there any other conclusions I wanted to say?
Okay. Um, I think that was most of that.
I guess their overall message is you can build super good, um,
text classification systems using ConvNets,
and you should take away that message.
Okay. So, there are just a couple of minutes left.
There was sort of one other thing that I wanted to mention,
but I think I'll just sort of mention it very quickly,
and you can look in more detail if you want to.
So, we sort of have this situation
that re- recurrent neural networks are a very standard building block for NLP,
but they have this big problem that they just don't parallelize well.
And the way we get fast computation deep learning is we find
things that parallelize well so that we can stick them on GPUs.
GPUs only are fast if they can be simultaneously doing the same computation many times,
which is sort of trivial for a convolutional neural network,
because precisely, you're doing the same comput- computation every position.
But that's not what's happening in the recurrent neural network because you have to
work out the value of position one
before you can start to calculate the value of position two,
which is used for the value of position three.
Um, so this was a piece of work, um,
done by sometimes CS224N co-instructor
Richard Socher and some of his people at Salesforce Research
on saying, how can we get the best of both worlds?
How can we get something that's kind of like a
recurrent neural network, but doesn't have the bad computational properties?
And so the idea that they had was, well,
rather than doing the standard LSTM style thing where you're calculating, you know,
an updated candidate value and your gates in terms of the preceding time slice,
maybe what instead we could do is we could stick a relation between time
minus 1 and time into the MaxPooling layer of a convolutional neural network.
So, we're sort of calculating a candidate and a forget gate and an output gate.
But these, these candidate and the, um,
gated values are done inside the pooling layer via compute,
um, via, um, uh, uh, convolutional operation.
So, it sort of get,
it doesn't, it, you know,
if there's no free lunch you can't get true recurrence and not pay the penalty.
This is giving you sort of a pseudo-recurrence because you are
modeling an association between adjacent elements at each time slice,
but it's sort of just worked out locally rather than being carried forward,
um, in one layer.
But sort of what they found is,
if you made your networks deeper using this idea,
well then, you sort of start to, again,
expand your window of influence.
So, you got a certain amount of information being carried forward.
Um, so, their conclusions was that you could sort of
build these kind of models and get them to work,
you know, not necessarily better actually on this slide,
um, it says often better.
Um, you can get them to work kind of as well as an LSTM does,
but you could get them to work much faster because you're avoiding
the standard recurrent operation and keeping it as something that you can parallelize,
um, in the MaxPooling operations.
Um, yes, so that was a kind of
an interesting alternative way of sort of trying to get some of the benefits.
I think long-term this isn't the idea that's going to end up winning out.
And so next week we're going to talk about transformer networks,
which actually seems to be the idea that's gained the most steam at the moment.
Okay. I'll stop there for today. Thanks a lot.
