Okay, so let's begin.
Um, first of all,
um, I want to say congratulations,
you all survived the exam.
Uh, well you don't have your grades back but you, you completed it.
Um, so yeah, I just want to say,
so we're gonna have the grades back as soon as we can.
Um, the CAs are all busy grading.
And we're actually gonna cancel office hours today so we can
focus on getting those grades back to you quickly.
Um, after this, the course definitely goes downhill.
So you guys can [LAUGHTER] kind of like take a breath.
Um, so after the exam there's pretty much just two things left.
So there's the project, um,
so the final presentation, uh,
the poster session for the project is going to be I believe the Tuesday after vacation.
Um, it's like a big auditorium hall,
there's gonna be a lot of people from industry and academia.
Um, it's really exciting to have like, you know,
so many smart people showing off their hard work.
Um, and then you have the last p-set which is logic. Yeah?
It's on Monday.
Oh Monday, okay, yeah.
Um, so, so right after you're back from vacation is that poster session.
Um, and then on Thursday is the last piece of it is due, logic.
Um, so logic is, um, uh,
so this is not like my official opinion but lo- I think
logic when I took the class was easier than the others,
uh, it doesn't take as much time so
you guys are definitely past the hardest point in this class.
Yeah, I think, I think that's the general opinion, yeah.
Um, but that being said,
I wouldn't wait until the last minute,
so I'd still start early and-
Personally, I didn't get the [inaudible].
[LAUGHTER] Then, um, yeah so Piazza and the office hours will be your best friend, yeah.
Um, okay.
So but today though we're talking about
this fun advanced-ish topic which is deep learning.
Um, I say ish because I think a lot of you are probably working,
um, on deep learning or have heard of it already.
A lot of you have it in your projects, um, and today,
um, we just kinda do a very,
very high level broad passive,
a lot of different subjects within deep learning.
Hopefully get you excited, um,
give you kind of like a shallow understanding of a lot of
different topics so that if you wanna take, um,
follow-up classes like 224N or, um,
229 even, uh, then you'll be armed with some kind of background knowledge.
Um, okay, so first we're gonna talk about the history.
So deep learning, you've probably heard of it,
it's really big especially in the last five,
ten years, but it's actually been around for a long time.
Um, so even back to the '40s,
there's this era where people are trying to build more computational neuroscience models.
They noticed that they knew back then that, you know,
there's neurons in the brain and they're arranged in these networks,
and they know that intelligence arises from these small parts.
And so they really wanted to model that.
Uh, the first people to really do this were McCulloch and Pitts.
Uh, so Pitts was actually a logician,
and it was, um,
they were concerned with making these kind of like
logical circuits out of a network like topology.
Um, like what kind of logical expressions can we implement with the network?
Um, back then this was,
this was all just a mathematical model.
Like there was no backpropagation,
there were no parameters,
uh, there was no inference.
It was just trying to, uh, write about fru,
I guess like theorems and proofs about what kind of problems these structures can solve.
Um, and then Hebb came along about 10 years later and started
moving things in the direction of I guess like training these networks.
Uh, he noticed that if two cells are firing a lot together,
then they should have some kind of connection,
um, that is strong.
Uh, this is- was inspired by observation.
So there's actually no formal math theory backing this.
There was, a lot of it was just, uh,
very smart people making, um, conjecture.
And then it wasn't until the '60s, um, that,
so neural networks was I guess you could say
maybe in the mainstream like a lot of people were thinking about it and excited about it,
until 1969 when Minsky and Papert they released this, uh,
very famous book called Perceptrons, uh,
which was this like big fat book of proofs.
And they were basically talking about the,
they approved a bunch of theorems that were about the limits of,
uh, very shallow neural networks.
Um, so for example,
[NOISE] um, early I think very,
very early in this class we talked about the XOR example where if you have, um,
two classes and they're arranged in this, um,
configuration then there's no,
there's no linear classification boundary
that you can use to separate them and classify them correctly.
And so th- Minsky and Papert in
their book Perceptrons they came up with a lot of these, um,
I guess you could say like counterexamples, um,
like that a lot of theorems that really proved
that these thin neural networks couldn't really do a lot.
Um, and at the time it,
it was, it was a little bit of a killing blow to neural network research.
Uh, so mainstream AI became much more
logical and neural networks were pushed very much into I guess a minority group.
Uh, so there's all these people thinking about and working on it.
But the mainstream AI went definitely towards kind of the symbolic logic based,
um, methods that Percy has been talking about the last couple of weeks.
Um, but like I said,
there's still these people in the background working on it.
So, um, for example in 1974, um,
Werbos came up with this idea that back-propagation that we learned about using
the chain rule to automatically update weights in order to improve predictions,
um, and then later on, um, so Hinton,
and Rumelhart, and Williams,
they kind of I guess you could say they, um,
popularized this, so they,
they definitely I guess you could say rediscovered, um,
Werbos's findings and they really said,
"Oh, hey everybody, you can use backpropagation."
Um, and it's a
mathematically, well kinda like well-founded
way of training these like deep neural networks.
Um, and then in the '80s, uh,
so today we're gonna talk about two types of neural networks;
convolutional neural networks and recurrent neural networks.
And the convolutional networks trace back to the '80s.
So there's this neocognitron that was invented by a Japanese, uh, Fukushima,
[NOISE] and it kind of laid out the architecture for a CNN,
but there was no way of training it.
And in the actual paper,
they used hand-tuned weights.
They're like oh, hey,
there's this architecture you can use and basically
we just like by trial and error came up with these numbers to plug in and,
and look at how it works.
Uh, now it just seems like insane,
but back then that was this, you know,
there were no ways of training these things.
Um, until LeCun came about 10 years later, and so,
um, he applied those ideas of backpropagation to CNN's.
And LeCun actually came up with a,
so there's the LeCun Net which was a very famous check reading system,
um, and it was one of the first like
industrial large-scale applications of deep learning.
Uh, so whenever you write a check in and have your bank read it, um,
almost all the time there's a machine-learning model that
reads that check for you and, um,
those check reading systems are some of
the oldest machine-learning models that have been like used at scale.
And then later, so recurrent neural networks came in the '90s,
so Elman kinda proposed it and
then there's this problem with training it that we'll talk about later,
um, called expect- exploding or vanishing gradients.
And then, um, Hochreiter and Schmidhuber,
about 10 years later came out with I guess you could say maybe a,
it solved to some extent those issues with a long short-term memory network, an LSTM.
And we'll talk about that later.
Um.
And then- but I guess you- you could still say that,
um, neural networks were kind of in the minority.
So in the '80s,
you used a lot of rule-based AI, um,
in the '90s, people were all about, uh,
support vector machines and inventing new kernels.
Um, if you remember support vector machine is
basically just like a it's- it's a linear classifier with the hinge loss,
and a kernel is a way of projecting, um,
data into kinda like a non-linear subspace.
Um, but it was- the 2000s,
people finally started making progress.
Um, so Hinton had this cool idea of hey,
we can train these deep networks one layer at a time.
So we'll pre-train one layer,
and then we'll pre-train a second layer and stack that on,
third layer stack that on,
and you can build up these successive representations, um.
And then deep learning kinda became a thing.
Er, so this looks like maybe,
uh, three-four years ago where they started taking off.
And ever since then, it's really been in the mainstream,
and you can as kind of proof evidence towards its mainstreamness.
Uh, you can look at all of these applications.
So speech recognition.
Um, for about almost a decade,
this performance in speech recognition, um,
state-of-the-art recognizers were using a hidden Markov model based um,
like that was- that was the heart of these algorithms.
And for 10 years,
performance just stagnated and then
all a sudden neural networks came around and dropped that performance.
And what's new and surprising is that all of the big comps so IBM, Google, Microsoft,
they all switched over from
these classical speech recognizers
into fully end-to-end neural network-based recognizers, er,
very quickly in a matter of years,
and when these- these large companies are operating at scale and they've, you know,
dozens maybe hundreds of people have tuned
these systems very intricately and for them to so
quickly and so radically shift
the core technology behind this product really speaks to its power.
Um, same thing with object recognition.
So there's this er ImageNet competition er which
goes on every year that says basically like how well can you say
what's in a picture and the first and so for years people use these handcrafted features,
um, and all of a sudden AlexNet was proposed
and it almost got half the error of the next best submission for this competition,
and then ever since then people have been using neural networks.
And now if you want to do computer vision, um,
you kind of have to use these CNN's, it's just the default,
if you walk into a conference,
every single poster is going to have a CNN in it.
Um, same thing with Go.
So, um, um Google DeepMind had a,
had a CNN based um algorithm,
they trained with reinforcement learning and it beat
the world champion in this very difficult game,
and then in 2017 it did even better,
it didn't even need a like real data just did self play um, and machine translation.
So Google Translate for almost a decade had been working on building a very,
very advanced and a very well performing classical machine translation system
and then all of a sudden, um,
the first machine translation system was proposed in 2014-2015,
and then about a year later they threw away 10, you know,
almost a decade of work on this system and
transferred entirely to a completely new algorithm,
um, which again speaks to its power.
Er, so but what
is I guess deep learning like what why
is this thing so powerful and why is it so good and,
um, I think, um,
so broadly speaking it's a way of learning, um,
of taking data you can slurp up any kind of data you want like I sequence, a picture, um,
even vectors um, or even like a game like Go and you can turn it into a vector,
and this vector is going to be a dense representation
of whatever information is captured by that data.
And this is very powerful because
these vectors are compositional and you can use these components,
these modules of your deep learning system kind of like Lego blocks,
you can, you know,
concatenate vectors and add them together and use this to
modify you and just the compositionality makes it very flexible.
Um, okay. So today we're going to talk about feedforward neural networks,
convolutional networks, which work on images,
or I guess just anything with repeated kind of structural information in it,
recurrent neural networks which operate over sequences,
and then if we have time we'll get to some,
um, unsupervised learning topics.
Okay, so first for feedforward networks,
um, so in the very beginning of this class we talked about linear predictors.
Linear predictor, um, if you remember is basically you define like
a vector w that's your weights and then you hit it with some input,
and you dot them together and that just gives you output.
Um, and neural networks we defined very similarly.
So you can think of each of these hidden units
as the result of a linear predictor in a way.
Um, so working backwards you- so you have the,
you define a vector w and you hit it with some activation function- with some activation,
um, like inputs, some hidden inputs,
and you dot that with your hidden and you get your output,
and then you arrive with what you are hitting
by defining a vector and hitting it with inputs.
Er, so you use your inputs to compute
hidden numbers and then you use your hidden numbers to compute your final output.
Um, so in a way,
you're kind of like I guess stacking linear predictors,
like each, each number.
So h1, h2, and f of beta are all the product,
I guess you could say they're all the result of like
a little mini linear predictor and they're all kind of like roped together.
Um, so just to visualize this,
if we want to go deeper you just rinse and repeat.
So this is- you can say this is a one layer neural network,
it's what we were talking about before with linear predictor.
You just- you have your vector weights and you apply it to your inputs.
For a two layer, you apply,
instead of a vector, to your inputs,
you apply a matrix to your inputs which gives you a new vector,
and then you dot this intermediate vector,
this hidden vector with another set of weights and that gives you your final output,
and then you can just rinse and repeat.
So you pass through a vector,
you pass through a matrix to get a new vector.
You pass that through another matrix to get a new vector,
and then you finally at the very end dot it with a vector to get a single number.
Um, so just a word about depth,
that's one of the reasons why these things are really powerful.
Um, so there's a lot of interpretations for why depth
is helpful and why kind of like stacking these matrices works well.
One way to think about it is that it learns
representations of the input which are hierarchical.
So h is going to be some kind of representation of
x. H prime is going to be a slightly higher-level representation of x.
So for example in a lot of image processing systems,
h maybe represents, um,
h could represent like the edges in a picture.
H prime would represent, um, like corners.
H double prime could represent er like small like fingers or something.
H triple prime would be the whole hand.
Er, so it's successfully,
I guess you could say like
higher-level representations of what's in the data you're giving it.
Um, another way to think about it is each layer is kind of like
a step in processing and, um,
you can think of it maybe like a for-loop where it's- it's like the more, the more, er,
the more iterations you have,
the more steps you, have the more depth you have,
um, the more processing you're able to perform on the input.
And then last, um,
the deeper the network is, um,
the more kinds of functions it can represent,
and so the, um- yeah so there's flexibility in that as well.
Um, but in general,
there isn't really a good formal understanding of why
depth is helpful and I think a lot of
deep learning is- there's definitely a gap between the theory and the practice, um.
So yeah, so this I guess just goes to show why depth is helpful,
so if you input pixels,
maybe your first layer is giving you edge detection and your second layer is giving you
little eyes or noses or ears and
then your third layer and above is giving you whole objects.
Um, yeah.
So just a summarize;
so we have these deep neural networks and
they learn hierarchical representations of the data successfully,
um, I guess you could say it's like gaining altitude in its perspective, um.
You can train them the same way that we learned,
er- you can train them the same way that we learned how
to train our linear classifiers just with gradient descent, um,
so you have your loss function,
you take the derivative with respect to your loss and then you
propagate the gradients to step in a direction that you think would be helpful.
Um, and this optimization problem is difficult,
um, so it's non-linear, and non-convex, um,
but in general if- we found that if you throw like a lot of data at it,
a lot of compute at it then somehow you manage, um.
Okay. So it seems like the slides are a little out of order,
but basically just to review how you train these things.
Um, in general, it's the same as a linear predictor.
You define a loss function.
So for example, this is squared loss, where you'd say,
I'm going to take the difference between my true output and my predicted output,
and square that, and then the idea is to minimize this.
Um, and the way you do that,
is you sample data points from your training data,
and you take the derivative of your parameters with respect to this,
um, with respect to your loss function,
and then you move in the opposite direction of that gradient,
which would hopefully move you down on the area surface.
Um, so the problem is a non-convex optimization problem.
So, er, for example, linear classifier,
because it's linear, will have co- it'll- it'll just look like a bowl,
um, whereas, these things,
you have these non-linear activation functions,
and you end up with a very messy looking area surface.
Um, and before the 2000s,
that was the big- that was the number one thing that was holding back neural networks.
Is that they are difficult to get working or hard to train.
Um, and so basically the thing that's changed is,
one, way faster computers.
We have GPUs which can parallelize operations,
especially those big matrix multiplications.
And then there's a lot more data.
Um, that's not entirely true.
So there's also a lot of other tricks that we found out recently.
So for example, if you have lots of hidden units,
then that can be helpful because it gives more- it gives more flexibility,
you could say, in the optimization.
Like if you have- if you over-provision,
if your model has more capacity than it needs,
then you can be more flexible with the kind of functions that you can learn.
Um, so we have better optimizers.
So whereas SGD will make- it'll take-
It'll step in the same direction by the same amount every time.
We have these newer optimizers like AdaGrad and ADAM,
that decide how far to move in a direction,
once you've decided the direction.
We have dropout, which is where you noise the outputs of each hidden unit,
and that makes the model more robust to its own errors,
and it guards against overfitting.
Um, there's better initialization strategies.
So there's things like Xavier initialization,
and there's things like pre-training the model on
a related data set before moving on to the data you actually care about,
and then there's tricks like batch norm, uh,
which is where you ensure that the inputs to your neural network units have,
uh- are normally distributed.
They have mean zero, standard deviation one,
and what that does is it allows you to basically take bigger step sizes.
Um, yeah, and the takeaway here is that- but i- in general
the optimization problem and the model architecture you define are very tightly coupled,
and um, it's kind of a black magic to get that right balance that you need,
and we're still not very good at it.
Um, okay. So we're gonna talk about convolutional neural networks now,
and so these operate over images.
Um, the motivation is that- okay,
so we have a picture here right?
And we want to do some kind of machine learning processing on it.
Um, we have all the tools that we need to do that.
You could say, Okay,
each picture, each pixel,
is an element in a big long vector,
and then I'm just gonna throw that out of matrix.
Um, but the thing is- is that- that
doesn't really take advantage of the fact that there's spatial structure in this picture.
So this pixel, is going to be more similar to this pixel,
than this pixel down here.
But if you pass this entire thing through a matrix,
then every pixel is gonna be treated uniquely and differently,
and so we wanna leverage that spatial structure.
And the idea to- the core idea is with um,
convolutions- so convolutions you have this thing called a filter,
which is some collection of parameters,
and what you do is you run your filter over the input,
um, in order to produce each output element.
So for example, this filter when applied to this upper left corner, um,
produces this upper left corner of the output,
and an application of a filter works kind of like a dot-product.
Where you multiply- you multiply all the numbers,
and then you add them all up,
and so how you produce these outputs,
is you take your filter,
and you basically just slide it around in the input,
um, in order to get your output at the next layer.
Um, so- yeah, so this- this example is a little more concrete.
So here- so whereas this was a two dimensional convolution,
because we had a two-dimensional filter,
and we were sliding around in both dimensions.
This is one-dimensional.
We have a one-dimensional filter,
and we slide it horizontally across.
So for example, at- at the very left, we apply it.
So 1 times 0 is 0,
0 times 1 is 1,
and negative 1 times 2 is negative 2.
So negative 2 goes in the output,
and then we do the similar thing here.
So we would dot-product this filter with um,
these three numbers in order to arrive at two.
Um, one of the advantages of this, is that,
whereas a- so if you
had- so if you had- let's say you had um,
four inputs, you- so this- this is your hidden layer.
So what h_1, h_2, h_3, and h_4,
and then you had four inputs: X_1,
X_2, X_3, and X_4.
If you did a regular fully-connected matrix layer,
then every one of these is going to be connected to every one of these,
and your parameters, you're gonna- you're gonna end up with a four by four matrix,
W_11, W_12, W_13, W_14,
and W, W, W,
W. This is what your matrix is gonna look like if this is your
W. Cause you need a new- you need a way for every one of those connections.
Whereas if you're doing convolutions,
it's much more efficient.
Because there's this idea of local connectivity.
So you have your h_1, h_2, h_3,
h_4, X_1, X_2, X_3, X_4.
Each hidden layer is only connected to,
um, what's called its receptive field.
Which is the inputs that the filter would be applied to,
and in this case,
we will only have three weights.
Because we just have this sliding window,
and you apply it at each step.
Um, so A, gives you local connectivity.
Um, B, it's much more efficient in terms of parameters.
You- you're sharing the same parameters at different places in the input.
Um, and it gives you this cool intuition of sliding around in the input.
So it's like, I have my filter of three things,
and it gives you this good intuition of- if-
if a- let's say this is- let say this is negative 1,
this is 100, this is 1, and this is 3.
Then with this, you can- you can interpret this as
my filter really likes whatever pattern is going on in these three inputs.
Um, and it doesn't like so much all the other patterns that it's picking up on.
Um, and so you- yeah,
you have this nice interpretation for the filters.
Um, in general what this looks like,
is- so in practice,
instead of one-dimensional two-dimensional,
they're very high-dimensional volumes.
Um, and so your filter is going to be a cube in the input space,
and you're sliding it around,
and applying it at every place it can fit in this input, and then,
the reason why the output is also a volume,
is because, um, you have multiple filters.
So over here for example,
this blue filter is- when you slide it around an input,
it's gonna give you this, uh,
like plane of outputs.
But then you have a second filter, this green filter,
that you can also slide around the input,
and that's gonna give you a second dimension to your hidden states.
Um, so Andrej Karpathy has this nice demo,
where basically we have- so we have a three-dimensional input,
and we have two filters which are like,
you can think of as little cubes,
and it's sliding these cubes around the input,
and every application gives you one output in this,
um, like three-dimensional output volume.
So this is, uh,
the same picture as before where you're sliding around cubes in order to fill in um,
you could say like layers of the output.
Sliding around cubes to fill in these layers of the output.
Another thing people do is max pooling.
So remember that interpretation of a filter as a,
as like a pattern detector.
Um, what this is saying is you take a region in your input.
So you, you run your filters of the input.
Get your, uh, like preliminary output,
and then you look at regions in the output and take
the maximum activation and carry that on to successive layers.
An intuition there is that instead of,
um, that you're looking- you're searching for a pattern in a region of the input.
And it's also helpful because remember at the end
of the day we want to do classification or regression or something.
We wanna get this thing down to like a very small number of, um, numbers,
and if we have this huge high dimensional volume,
then any way we can reduce its size is good.
Um, so this is an example of how these things work is,
is- it's they're pretty straightforward basically so you,
you just have your convolutional layers,
you stack them all up,
every once in awhile you have some pooling and you go
down and down in dimensionality until you eventually get down to,
um, er, like a distribution over possible labels.
And this ties into what I was saying before about that Lego block analogy because this,
this entire network is built up of one, two, three,
four different Lego blocks in a way,
and it's basically just stacking them on top of each other and
composing them up in order to get a image classifier.
So I'm going to talk about three case studies of CNN architectures.
So the first one, um, is AlexNet.
So this was that one that did really well in
the ImageNet competition and really brought CNN's to the mainstream for computer vision.
Um, basically it was just a really big neural network.
Um, um, one trick they did was they used ReLU's instead of, um.
So the sigmoid that we've learned about I think we've.
The sigmoid that we've learned about, um,
this is an activation function and it's gonna look something like this.
And what they did was instead they use the ReLU,
um, which looks a little more like that,
and in practice it turns out to be a little easier to train and use.
Um, the next one is VGGNet,
um, which did on ImageNet a couple years later.
Basically, it's, um, it's very similar,
it's just a CNN.
I think the thing to note about this one is that it's very uniform, um,
so it was 16 layers and there's nothing fancy in it.
It was just a bunch of these Lego blocks stacked up.
Um, the entire network is pretty much,
uh, like you- just by looking at this picture
you could probably re-implement their network.
Um, something else to note about it is it started this trend of deeper,
of, of kind of like tall and skinny networks.
So you'll notice that there's a lot of layers,
but each layer is, uh, very thin.
And residual networks or ResNets,
um, kinda takes that to the nth degree.
So the idea with a ResNet is, um,
so most of the time you take your input,
you pass it through a matrix state and output.
Um, if you add in your input again,
then that is very helpful because it makes
it easy for the model to learn the identity function,
and so you can give the model the capacity for like 100 layers.
But if you add in these residual connections
which is what you call it when you basically just like add in that,
add in x, um,
then allows the model to skip a layer if it decides that that's what's best for itself.
You just set W to 0,
that's what you would do.
Um, so it also helps with training.
Um, back-propagation, if you take the derivative of the loss with respect to your input,
uh, that derivative is, is just going to be 1,
um, for this part of the sum.
And so it gives- in a way you could think of it as it,
as it gives the- it gives the error signal kinda like a highway through the network,
um, and it allows the gradients to propagate
much deeper into these large neural networks.
Um, and so ResNet got a 3.6 % error on ImageNet.
If you remember the AlexNet would-
it blew everyone off the water and it got like 15%.
Uh, I think this is much better than human performance and, um,
it will come up later when we talk about- this idea of like
residual connections will come up later when we talk about recurrent neural networks.
So just to summarize, uh,
convolutional neural networks are often applied in image classification.
Um, the key idea is that there is- you have
these filters which you are sliding around the input and that lets you one,
um, have- it does,
uh, kind of this like idea of local connectivity.
So as a space in the ou- in the output only
depends on a small patch of the input instead of the entire input.
And then second, um,
it's- the parameters are shared.
Um, depth has turned out to really matter
for these networks and I think to this day it's like people,
it's like every day there's just a deeper network that's coming out and
people haven't really found a bound to depth I guess, yeah.
What's the best way to design one of these networks?
Is it trial and error, um,
but effectively you're trying to get out of some results or
is there any intuition as to how many layers,
what the layer should be and so forth?
Yeah. So the question was how to design
these things since there's- they seem so arbitrary.
[LAUGHTER]. And yeah it is,
it is really arbitrary.
I think, um, I think- so there's a few different ways.
So first, um, you start with something.
Okay. So first, you would start with something that sounds reasonable,
um, and then you would do some kind of like a grid search or you would do, um,
now there's a literature on
meta-learning which is where like you have a model, decide what
your model looks like or- but in most cases you just kind of like hand-tune it,
you're like oh if I add a layer does it go up or down?
Um, second, you look at the literature and say okay,
someone else solved a similar problem to me and they used network X,
Y and Z and so I'm gonna start with that and then start fiddling from there.
Um, and then third is to literally take
that network that's been pre-trained on an- a task and then apply it to yours.
Uh, we'll talk about it later but pre-training networks and
applying that to your task has shown to be very helpful.
Okay. So now we're gonna talk about recurrent neural networks.
The idea here is that you're modeling sequences of input.
Um, this could be things like text or sentences.
It could also be things like time series data or financial data.
Um, and the recurrent neural network is something where the input,
um, it feeds its past inputs into itself,
so it has time dependencies.
So for example, we have this very simple recurrent neural network here.
Um, it is a function with one matrix and it
takes as arguments a past hidden state and a current input,
and then it predicts the next hidden state.
So this, this is what it looks like if you were to write it in code or something.
This is what the actual network looks like.
So there's an input and you feed that into your function as well as your current state,
and it just kind of loops on itself.
Most of the time, people talk about this third perspective which is, uh,
taking kind of this network and I- and kind of like unraveling it across time.
I guess you can say unfolding it across time, um,
where every time-step you have an input and you have a state,
and then you have your function which carries you to the next state. Yeah.
I'm just curious, how does it differ from having
your original weight and
updating that weight because it sounds like that- that's a similar analogy.
[inaudible] we have the previous state.
Yeah.
But how is it compared to our machine [inaudible] classifier,
so we had the previous weights and we're just updating the previous weights.
Oh I see. So the question was,
what's the difference between this and the setting before when we had- when we were,
like, stochastic gradient descent where we were updating our weights sequentially.
Yeah. So, um, that is- that is an interesting question.
So the inf- difference is that before,
for SGD, that was for,
um, it was sequential in the training whereas this is sequential in the inference.
So each- so you do- you feed in ten inputs,
let's say, ten timesteps as inputs.
And then after all that time,
then you back-propagate once for all those time steps.
So to make that more clear.
So for SGD, it's like you have x_1,
y_1, x_2, y_2, and you use this to update w, right?
So you update w, and then you update w. And,
yeah, that's- that's an interesting observation
that there's this kinda like time-dependency.
Um, but there's no time within the data itself.
For, for the recurrent setting,
it's, it's more like this.
It's like, um, if,
uh, which marker is better? So it's more like this.
It's more like you have x_1_1, x_1_2, x_1_3,
and y, and then x_2_1,
x_2_2, and x_2_3 and y.
And then you use this to update w. And in this setting,
when we talk about time or temporal,
like, when we talk about a sequence,
we're talking about a sequence here in the data,
not necessarily in the learning.
Yeah. Okay. So to make- this a more concrete example.
We're gonna talk about a neural network language model.
So this is a model that is in charge of sucking
in a sentence and predicting
what is the most likely word that would come next in the sentence.
So we're- so each input,
we call x, and our hidden states, we call hs.
And the way this works is we have some function that
takes x_1 and it encodes it into our hidden state.
And then we have a second function that takes
the hidden state and decodes it into the next input.
We continue by taking both the next both x_2, our next input,
and h_1, our previous hidden state,
and then we use that to create a new hidden encoding.
And then, we take that new hidden encoding and decode it into our next input.
And we just rinse and repeat, each time we take the current input and
the previous hidden state to first create
an encoding and then second predict our next input.
So there's those two steps.
The cool thing about this to note is that this now, we're building up vectors,
these h_is, and that's exactly what we're looking for.
It's a vector that, in some way,
captures the meaning or a summary of
all the x- all the inputs that we fed up until that time step.
So now, we have a vector which compresses all those inputs into one vector.
So to make this very concrete,
one way you could build this thing is by basically,
each of these arrows,
you stick a matrix in there.
So our encode function would take the input, the x_t,
and multiply it by a matrix in order to get a vector and then it would take
the previous hidden state h_t minus
1 and multiply that by different a matrix to get a new vector,
and you add those vectors,
and that gives you a vector which is your new hidden state.
And decode is the same thing,
you take your hidden state,
pass it through a matrix to a vector,
and then, um, send that through a Softmax to turn
the vector of logits into a distribution of probabilities.
In general though, there's this problem with recurrent neural networks.
So if there is a short dependency which mean the input and the output,
if an output depends on a recent input,
then the path through this network is very short, right?
So it's easy for the gradients to
reach where it needs to reach in order to train the network properly.
But if there's a very long dependency,
then the gradients have a- have- they have difficulty getting all the way through.
You- so, uh, if you remember,
we talked about, um,
gradient descent as a credit-assignment problem where the gradient is in some ways,
like saying, how much, um,
okay so if I change the in- if I change
this input by a small- if I perturbed by a small amount,
how much will the output change?
That's- in what's sense,
that what the gradient is saying.
But if the input and output are super far away from each other,
then it's very difficult to
compute how small perturbations the input would affect your output.
And the reason for that,
we won't get into it so much,
but basically if you want to compute the gradient,
then, what you have to do is you have to trace
the entire path of that dependency
and you'd look at all the partial derivatives along that path,
and you multiply them all up.
And so the problem is is that if the path is very long,
then you're multiplying a lot of numbers,
and so if your numbers are less than 1,
then the- then the product,
the overall product is gonna get really small and really fast, right?
And if the numbers are bigger than 1,
then that product is going to blow up really quickly.
And so that is a problem because it means your gradients are
going to be tiny and no learning signal,
or they're going to be way too big and you're going to just like
shoot into some crazy direction that,
in practice, will like blow up your experiments and nothing will work.
[LAUGHTER]. So it's a problem.
Um, So for the,
the good thing is that for
the exploding gradient problem that's not so bad, there's a quick fix.
What people do is, is,
what they do is what's called clipping gradients.
So you'll specify some norm.
You'll be like, any gradient with a norm bigger than 2,
I'm going to clamp off at 2.
So if your gradients explode and they go to 10 million,
you're gonna say, "Okay, that's bigger than two.
So it was- it wasn't a million,
it was actually two," and you go from there [LAUGHTER].
But for the vanishing gradient problem,
here's this cool idea.
The long short-term memory cell, uh,
which is similar to a recurrent neural network,
um, but it has two hidden states.
And, um, so this is kind of a wall of equations,
but I think the important thing to note is that you're doing this,
this is basically like your input in a way,
and this is kinda like your previous hidden state.
And so what's going on here is you're doing an additive combination,
you're taking your input and you're adding in your previous hidden state,
very similarly to those residual connections in the ResNet.
And so this- because you're adding in your previous state,
it's kinda like adding in your previous input, I guess you could say,
and it allow- it gives the gradients, uh,
kind of like of a highway to very easily go back in time.
Um, there's another perspective on this.
So this picture that in- the, uh,
notation is different but, um,
I think the thing to note here is that this- so those,
those are, are you could say,
are hidden states in this network.
Or I guess- so in LSTM- so you co- I guess,
you could say that there's two hidden states. That's what people say.
Um, so you have one hidden state
which is your HT and that's the state that you expose to the world.
If you say, "My LSTM is gonna have the same API as my RNN,
then this would be like the equivalent of that hidden state that we have for RNN.
But then, you also have this internal hidden state- the C state,
that you never expose to the world.
And so in this picture, um,
the- sorry the notation is a little confusing but
the o in this picture corresponds to the h in the previous picture.
So this is the hidden state that you are exposing to the world.
And then the s corresponds to c,
so this is your internal hidden state.
And the thing to note about this picture is that s
is just zipping around on what's called the constant error carousel.
And it's always internal and it's zipping around this thing in a loop.
And so, um, what it ends up doing in practice is it ends up lear- like that- it's
a vector and it contains
very long-term information that's useful for the network over many many time steps.
So if you, if you poke around at individual dimensions of that state,
then you an- then you can find these long term things being learned.
So for example, Andrej Karpathy has a great blog post.
Um, you find networks that- you find units that track the length of the sentence,
you find units that track syntactic cues like quotes or brackets.
But in general, you find a lot of things that are just not easily interpretable.
Uh.
So one last cool.
I guess, idea that people have used with these recurrent neural networks
is sequence to sequence models, like machine translation.
Um, which is where you have two sequences.
You have an input sequence,
and an output sequence.
And you want to suck in your input sequence and then spit out your output sequence.
And you do this with what's called the encoder decoder paradigm.
You encode your sequence,
um, by giving it to your RNN,
and that gives you a one vector which is encoding or compression of that input.
And then you decode your sequence by spitting out,
um, your outputs just like we were talking about before with the language model.
And more recently, there's
these attention-based models which are very
helpful in the case where there's long sequences.
So if you look back here, x_1, x_1, x_2,
x_3 are all getting compressed into a single vector.
Um, well if you have a really long paragraph,
maybe it's hard to shove that into your,
you know, 200 dimensional vector.
It's hard to capture the depth of all that language with just a bunch of numbers.
Um, and so the idea behind attention is to,
is to look back.
Um, so th- the way attention works is,
um, at a very high level is.
So if you have your inputs,
we have x_1, x_2, and x_3.
And we've run our RNN over these inputs.
And so we have three hidden state vectors,
h_1, h_2, and h_3.
And now we're decoding.
And so we have,
we have our RNN decoder that has some hidden state.
We'll call it s_1.
Um, what happens is you compare
your current hidden state with all of the states in your encoder,
and you compute a number that says how much do I like this state.
So maybe, maybe it like really,
really likes this number,
and it's not too happy about this num- this vector and it doesn't like this vector.
What it does is it, is it uses these scores to turn them into a distribution.
A probability distribution, that again says how much,
how much do I as s_1 like each of these vectors.
And then what you do, is you compute a weighted average of
these hidden vectors [NOISE] where the weights come from this distribution.
And what this does is it serves two purposes.
So first, um, and there,
there is another way of kind of writing this down on the slides, um.
But I think this serves two purposes.
So first of all,
it gives you some interpretation.
So every time step you can see what parts of the input is it focusing on,
what parts of the inputs have a lot of probability mass on them.
And then second, what it lets you do,
is it lets you, ah,
it kind of releases the model from the pressure of
having to put the entire input sequence into a single vector.
Now, what it can do is it can go-
dynamically go back and retrieve the information it needs.
Um, and then more recently,
uh, there's what's called transformer models,
um, which are, which do away entirely with the RNN aspect, it's just attention.
And with a transformer,
you have your hidden states,
and instead of, instead of having some kind of
decoder hidden state that you're comparing to the others.
What you're doing is you just- you select each hidden state and you
compare h_1 to all the other h's including itself to get a number of how much,
um, h_1 likes those other h's.
And then you compute your weighted average of all of these hidden states,
and that becomes [NOISE] your next layer.
Um, so I, I would recommend taking 224,
and if you're interested in this topic, um,
transformers are very cool and they've recently become,
I guess you could say like the new LSTM.
Um, so from these attention distributions you get
cool interpretable maps, like in translation.
So this is an attention distribution and it points at words that are correspondent.
So uh, like economic, um,
corresponds to economic and,
and you can see that in the distribution.
Um, they also do this in computer vision.
Um, you can highlight areas of a picture.
Yeah, so just to summarize.
So recurrent neural networks, um,
you can throw a sequence at them,
and they'll give you a vector.
Um, there's this intuition, uh,
that they are processing inputs sequentially kind of like a for loop.
But they have a problem with training where the gradients
either blow up or they shrink very small.
So LSTMs are one way of mitigating this problem.
Um, but they're not perfect,
they still have to shove information into one vector.
And so the way people get around this is with attention based models,
where you dynamically go back into your input
and retrieve the information you need as you need it.
[NOISE] Um,
so now we're gonna talk about, um, unsupervised learning.
Um, so like I said before,
neural networks, we got them to work well recently,
and a lot of that is just because they need a lot of data.
Um, but if you're a smaller lab or if you don't have
enough money to basically pay for a data set or
even if it's a hard problem that just there isn't a lot of data for it.
Um, there's a lot of cases where there isn't enough data to train these very,
very large models with millions, billions of parameters.
Um, but on the other hand,
there's tons of unlabeled data laying around.
And you can download the whole Internet if you want.
Um, and there's kind of this real inspiration from us, as human beings.
Like, we are never given labeled data sets of,
of what foods are edible and what foods are not edible.
Right. You just kind of, you absorb experiences from the world and then you
use that to inform your future experiences,
and you're able to kind of like reason about it and make decisions.
Um, so, uh, I'm gonna turn off my, okay.
So uh, yeah, so the first,
I guess thing that we're gonna get into is auto-encoders.
So the idea behind auto-encoders is that if you
have some information and you try to learn
a compressed representation of that information that allows you to reconstruct it,
then presumably, um, you've done something useful.
So in neural network speak the way that works is you,
ah, give it some kind of a vector,
and use pass that through an encoder,
um, which gives you a hidden vector.
And then you pass that hidden vector through
a decoder which you use to reconstruct your input.
And the implicit loss in most cases is you wanna take the difference.
Basically, you want your reconstructed input and the original input to be very similar.
Um, so just to motivate this, um,
this isn't deep learning but principal component analysis is,
could be viewed as one of these encoder, decoders.
And the idea behind principal component analysis is you
wanna come up with a matrix U, um,
which can be used to both encode and decode a vector.
So you multiply x by U to give you a hidden vector or a new representation of your data.
But then if you transpose U and multiply it by your hidden vector
then that should give you something as close as possible to your original data.
Um, so but there's a problem.
So as we- if we have a hidden vector with a bunch of units,
then it's not gonna learn anything, right.
It's just gonna learn how to copy inputs into outputs.
Um, and so a lot of research on auto-encoding
and this kind of unsupervised learning is about how to control the complexity,
and, and make the model robust
enough to generate useful representations instead of just copying.
Um, so the first pass at that,
can be using lawnmower transformations.
Um, so you would do something like the lo- logistic or the Sigmoid loss.
And that means that the problem can't be solved anymore by just copying into the output.
And so you're gonna have to actually learn something useful.
[NOISE] Um, another way of doing it is by corrupting the input.
So you have your input and you noise it.
Maybe you drop out some numbers from it.
Maybe you perturb some numbers of it.
Maybe you add in, maybe draw from like a Gaussian and add that to your input,
and then you pass that through.
Um, so yeah, so you could drop.
So if your vector is 1, 2, 3, 4,
you could drop out 1 and 4,
and just set them to 0,
or you could slightly perturb these numbers so that they're close to their original but,
um, different, not the exact same.
And then the idea is that after you pass this encoded input through
both your- after you pass this corrupted input through both your encoder and decoder,
then the output, the eventual x-hat should
be very close to your original uncorrupted input.
[NOISE] Um.
Yeah.
So another is a variational encoder which has a,
um, cool comp probabilistic interpretation.
Um, so you could think of it as kind of a Bayesian network.
Um, I- I think maybe this is more useful to look at.
So you have an encoder and a decoder and they are both modeling,
um, I guess probability distributions.
So what this is saying is I want to encode x into a distribution over h's.
And you learn a function which is in charge of doing that.
And then you want to specify some conditions.
First you say, "Okay,
I wanna make x recoverable from my h distribution."
And then second this is a term that kind of,
um, it prevents h from being degenerate.
So maybe a good way of thinking about this
is instead of- so our traditional autoencoder would take my input and it would map it.
It would send it through some kind of encoder and map it into a hidden vector.
Send that through a decoder and that would reconstruct my input.
Whereas a variational autoencoder is gonna take
my input and it's gonna map that into a distribution over possible h's.
And then what I'm gonna do is I'm going to sample from this distribution,
pass that through my decoder and produce my reconstructed input.
Um, and the nice thing about this is
that since this is a distribution instead of a vector,
you've imposed some structure on the space.
So points that are close together in this space should map to similar x-hats.
And then similarly as you move through
this space you shou- you should be able to gradually
transition from one I guess reconstructed input to a second.
So for example, there's these cool experiments in computer vision where they'll say,
up here, this is gonna give me like a chair or something.
Um, and then down here it's gonna give me like a table.
And then if- if I move from one to the other,
then- and you constantly decode then it'll like gradually
morph into the table. Um, it's really cool.
Um, okay, and then- so then the last method of,
uh, of unsupervised learning that we're going to talk about is motivated by this task.
So there's this dataset called SQuAD, um,
which is about 100,000 examples where each example consists of a paragraph,
and then a bunch of questions, uh,
like multiple choice questions based on that paragraph.
Um, so the problem here is that there's only 100,000 examples.
And really the intelligence that
this task is trying to get at is just can you read a text and understand it,
um, which is more general and is- is captured by more data than just these 100,000.
In particular, it's captured by just all the texts that you could possibly read.
So there's billions of words on Wikipedia, on Google.
You can just crawl the web and download it.
And if somehow you could leverage that
maybe that would be helpful for your reading comprehension.
Um, and that is
just a perfect case of this- of this setting where we have tons of unlabeled data,
very small amount of data.
Um, so recently the NLP community has come up with this idea called BERT.
Um, well, there's actually not just BERT,
but there's a lot of people who are doing
similar things but BERT is the example we're talking about.
With BERT what you do is you take a sentence and then
you mask out some of the tokens in the input.
And then you train a model to fill in those tokens.
And they actually train the model on a bunch of things.
So they trained it on token filling.
And then they also would glue two sentences
together and- and ask the model are these sentences,
um, like would they be adjacent in the texts or not?
Like do they make sense together or not?
Um, but the idea is basically like have- give a bunch of unlabeled text to
a model which is just going to kinda like manipulate
that data in order to learn structure from it,
um, without any explicit purpose other than just learning the structure.
And they trained it on a bunch of data for a long time.
And then what you can do is once- so BERT is actually
a big- so we talked about transformers before and BERT is like a big transformer.
So they trained this thing on a ton of unstructured texts,
on just this word filling task for a long time.
And then what they did was they took their pre-trained BERT
and they took the- and they started feeding it questions from SQuAD.
So they took questions and then they would glue on to it the paragraph,
the context that they used to answer the question and
then they would take whatever vectors are coming out of BERT,
and they would pass that through a single matrix which
basically predicts the answer for that SQuAD question.
Um, and it did really well.
So this picture is right when BERT was released.
And these are all of the state of the art, um,
models for SQuAD and BERT,
not only beat all the other models by a large margin,
but also beat human performance.
And, um, I guess the- the intuition- the intuition behind BERT was
that by doing these seemingly trivial tasks
like word filling and next sentence prediction,
what you end up learning is you end up learning the vectors that are
coming out of this are vectors that say like,
what is the meaning of a word?
Um, what is the meaning of a word in this context?
What is the meaning of this sentence?
And that meaning isn't like operationalized towards solving any task in particular,
um, it's just like, in a very general sense,
like what is- I'm going to imbue this model with
an understanding of language and then once I have an understanding of language,
I'm going to then apply it to my very targeted downstream task.
Um, and that is- that is kind of the principle behind unsupervised learning.
So you kinda like make up these almost trivial prediction tasks, uh,
just to manipulate data and learn structure from it,
understand language, understand what a picture is.
And then what you do is then
you fine tune it on the very small amount of label data that you have.
Um, and that's kinda what the current state of the art
is in a lot of fields is basically just
doing more and more unsupervised pre-training with bigger,
bigger models and bigger, bigger data.
Um, and the field really hasn't found like a limit to this yet.
And it'll be interesting to see how far it goes.
Um, so I'm going to skip those slides but just to kind of wrap things up, um,
recently I guess the biggest things
that people have gotten to get neural networks working is;
one, better optimization algorithms.
So we have these adaptive algorithms that are not as,
um, I would say like obtuse as SGD.
It doesn't have to move in this- by the same amount every time.
You have a lot of tricks,
like, uh, you know, fine tuning,
unsupervised learning, clipping the gradients, batch norm.
Um, we have better hardware.
We have better data and that allows us to experiment more and- and,
um, train larger models faster.
Yeah. We're waiting a long time.
But I think- I think one of- maybe one of the problems
with the field is that the theory is- is in a lot of ways lacking.
Um, and we don't know exactly why neural networks work well and
why they're able to learn good functions despite having a very difficult,
um, like optimization surface.
Um, yeah. So just to summarize.
We- we talked about a lot of different building blocks.
We talked about how to leverage spatial structure with convolutional neural networks.
We talked about how to feed sequences into
recurrent neural networks and transformers, and LSTMs.
We talked about, um, you know,
the sequence to sequence paradigm for machine translation and
unsupervised learning methods that help you kinda
like jumpstart your downstream applications.
Um, and I think the big takeaway here is that in-
in some ways the big advantage of a neural network is that they are compositional.
So you- it's like a- it's like Legos.
You take- they take an input and they turn it into a vector.
And then once you have a vector,
you can start combining these things in very flexible ways.
And so in a lot of ways designing these things is a lot like putting together a Lego set.
You have your building blocks in LSTM, attention and coding,
and you can decide how to- it's like,
"Oh, I wanna run this LSTM here and this LSTM here.
And then I want this one to attend over this one.
And then I'm going to concatenate the result with the output of this CNN."
Um, and because of- because of I guess you could say like the magic about propagation,
you can combine these things.
Um, and I think even more generally,
it allows you as a programmer to instead of make a program for solving a problem,
it allows you to make this scaffolding, um,
that allows a computer to teach itself how to solve the problem.
So, um, instead of defining the function,
you want the software to learn.
You define a very broad family of functions that the software is allowed to learn.
And then you let it go and run off and find the best match within that.
Uh, yeah.
So those are all the things we're talking about today.
Uh, but, um, I hope you all have a good Thanksgiving break.
