>> It's always fun introducing people who
need no introduction.
But for those of you who don't know Geoff
and his work, he pretty much created--he helped
create the field of machine learning as it
now exists and was on the cutting edge back
when it was the bleeding edge of statistical
machine learning and neural nets when they
first made their resurgence for the first
time in our lifetime, and has been a constant
force pushing it--pushing the analysis in
the field away from just sort of the touchy-feely,
let's tweak something until it thinks and
towards getting--building systems that we
can understand and that actually do useful
things that make our lives better.
So you--if you read the talk announcement,
you've seen all of his many accomplishments
and members of various royal societies, etcetera,
so I won't list those.
I think instead of taking up more of his time,
I'm just going to hand the microphone over
to Geoff.
>> HINTON: Thank you.
I've got--I got it.
So the main aim of neural network research
is to make computers recognize patterns better
by emulating the way the brain does it.
We know the brain learns to extract many layers
of features from the sensory data.
We don't know how it does it.
So it's a sort of joint enterprise of science
and engineering.
The first generation of neural networks--I
can give you a two minute history of neural
networks.
The first generation with things like Perceptrons,
where you had hand coded features, they didn't
adapt so you might put an image--the pixels
of an image here, has some hand coded features,
and you'd learn the weights to decision units
and if you wanted funding, you'd make decision
units like that.
These were fundamentally limited in what they
could do as these points out in 1969, and
so people stopped doing them.
Then sometime later, people figured out how
to change the weights of the feature detectors
as well as the weights of the decision units.
So what you would do is take an image share,
you'd go forwards through a feed-forward neural
network, you will compare the answer the network
gave with the correct answer, you take some
measure of that discrepancy and you send it
backwards through the net and as you go backwards
through the net, you compute the derivatives
for all of the connections strings here both
those once and those once and those once of
the discrepancy between the correct answer
and what you got, and you change all these
weights to get closer to the correct answer.
That's backpropagation, and it's just the
chain rule.
It works for non-linear units so potentially,
these can learn very powerful things and it
was a huge disappointment.
I can say that now because I got something
better.
Basically, we thought when we got this that
we cannot learn anything and we'll get lots
and lots of features, object recognition,
speech recognition, it'll be easy.
There's some problems, it worked for some
things, [INDISTINCT] can make it work for
more or less anything.
But in the hands of other people, it has its
limitations and something else came along
so there was a temporary digression called
kernel methods where what you do is you do
Perceptrons in a cleverer way.
You take each training example and you turn
the training example into feature.
Basically the feature is how similar are you
to this training example.
And then, you have a clever optimization algorithm
that decides to throw away some of those features
and also decides how to weight the ones it
keeps.
But when you're finished, you just got these
fixed features produced according to a fixed
recipe that didn't learn and some weights
on these features to make your decision.
So it's just a Perceptron.
There's a lot of clever math to how you optimize
it, but it's just a Perceptron.
And what happened was people forgot all of
Minsky and Papert's criticisms about Perceptrons
not being able to do much.
Also it worked better than backpropagation
in quite a few things which was deeply embarrassing,
but it says a lot more about how bad backpropagation
was and about how good support in fact the
machines are.
So if you ask what's wrong with backpropagation,
it requires labeled data and some of you here
may know it's easy to get data than labels.
If you have a--there's a model of the brain,
you [INDISTINCT] about that many parameters
and you [INDISTINCT] for about that many seconds.
Actually, twice as many which is important
to some of us.
There's not enough information in labels to
constrain that many parameters.
You need ten to the five bits or bytes per
second.
There's only one place you're going to get
that and that's the sensory input.
So the brain must be building a model of the
sensory input, not of these labels.
The labels don't have enough information.
Also the learning time didn't scale well.
You couldn't learn lots of layers.
The whole point of backpropagation was to
learn lots of layers and if you gave it like
ten layers to learn, it would just take forever.
And then there's some neural things I won't
talk about.
So if you want to overcome these limitations,
we want to keep the efficiency of a gradient
method for updating the parameters but instead
of trying to learn the probability of a label
given an image, where you need the labels,
we're just going to try and learn the probability
of an image.
That is, we're going to try and build a generative
model that if you run it will produce stuff
that looks like the sensory data.
Another is we're going to try and learn to
do computer graphics, and once we can do that,
then computer vision is just going to be inferring
how the computer graphics produce this image.
So what kind of a model could the brain be
using for that?
The building blocks I'm going to use are a
bit like neurons.
They're intended to be a bit like neurons.
They're these binaries stochastic neurons.
They get some input, they're given--I put
this either a one or a zero, so it's easy
to communicate and it's probabilistic.
So this is the probability of giving a one
as a function of the total input you get which
is your external input plus what you get for
other neurons times the weights on the connections.
And we're going to hook those up into a little
module that I call a restricted Boltzmann
Machine.
This is the module here, it has a layer of
pixels and a layer of feature detectors.
So it looks like he's never going to learn
lots and lots of layers of feature detectors.
It looks like we thrown out the baby with
the bath water and we're now just restricted
to learning one layer of features but we'll
fixed that later.
We're going to have a very restricting connectivity,
hence the name, where this is going to be
a bipartite graph.
The visible units for now don't connect to
each other and the hidden units don't connect
to each other.
The advantage of that is if I tell you the
state of the pixels, these become independent
and so you can update them independently and
in parallel.
So given some pixels and given that you know
the weights on the connections, you can update
all these units in parallel, and so you've
got your feature activations very simply,
there's no lateral interactions there.
These networks are governed by an energy function
and the energy function determines the probability
of the network adopting particular states
just like in a physical system.
These stochastic units will kind of rattle
around and they'll tend to enter low energy
states and avoid high energy states.
The weights determine the energies linearly.
The probabilities are an exponential function
of the images so the probabilities, the log
probabilities are a linear function of the
weights, and that makes learning easy.
There's a very simple algorithm that Terry
Sejnowski and me invented in--back in 1982.
In a general network, you can run it but it's
very, very slow.
In this restricted Boltzmann Machine, it's
much more efficient.
And I'm just going to show you what the Maximum
Likelihood Learning Algorithm looks like.
That is, suppose you said take one of your
parameter on your connection, how do I change
that parameter so that when I run this machine
in generative mode, in computer graphics mode,
it's more likely to generate stuff like the
stuff I've observed?
And so here's what you should do, you should
take a data vector, an image, and you should
put it here on the visible units and then
you should let the visible units via their
current weights activate the feature detectors.
So you provide input to each feature detector
and you now make a stochastic decision about
what the feature detector should turn on.
Lots of positive input, it almost certainly
turns on, lots of negative input it almost
certainly turns off.
Then, given the binary state of the feature
detectors, we now reconstruct the pixels from
the feature detectors and we just keep going
like that.
And if we run this chain for a long time,
this is called a Markov chain, and this process
is called alternating Gibbs sampling, If we
go back [INDISTINCT] for a long time, we'll
get fantasies from the model.
This is the kind of stuff the model would
like to produce.
These are the things that the model shows
you when it's in its low energy states given
its current parameters.
So that's the sort of stuff it believes in,
this is the data and obviously you want to
say to it, believe in the data, not your own
fantasies.
And so we'd like to change the parameters
the way it's on the connections, so as to
make this more likely and that less likely.
And the way to do that is to say, measure
how often a pixel i and a feature detector
j on together when I'm showing you the data
vector v. And then measure how often they're
on together when the model is just fantasizing
and raise the weights by how often they're
on together when it's seeing data and lower
the weights by how often they're on together
when it's fantasizing.
And what that will do is it'll make it happy
with the data, low energy, and less happy
with its fantasies.
And so it will--its fantasies will gradually
move towards the data.
If its fantasies are just like the data, then
these correlations, the probability of pixel
i and feature detector j being on together
in the fantasies will be just the same as
in the data, and so it'll stop learning.
So it's a very simple local learning rule
that a neuron could implement because it just
involves learning the activity of a neuron
and the other neuron it connects to.
And that will do Maximum Likelihood Learning,
but it's slow.
You have to settle for like a hundred steps.
So, I think about how to make this algorithm
go a hundred thousand times faster.
The way you do it is instead of running for
a hundred steps, you just run for one step.
So now you go up, you come down and you go
up again.
And you take this difference in statistics
and that's quite efficient to do.
It took me 17 years to figure this out and
in that time computers got a thousand times
faster.
So, the change in the weight now is the difference--is
a learning rate times the difference between
statistics measured with data and statistics
measured with reconstructions of the data.
That's not doing Maximum Likelihood Learning
but it works well anyway.
So I'm going to show you a little example,
we are going to take a little image where
we're going to have handwritten digits, this
is just a toy sample.
We're going to put random weights on the connections
then we're going to activate the binary feature
detectors given the input they're getting
from the pixels, then we're going to reconstruct
the image and initially we can get a lousy
reconstruction, this will be very different
from the data because they're random weights.
And then we're going to activate the feature
detectors again and we're going to increment
the connections on the data and we're going
to decrement the connections on the reconstructions
and that is neither going to learn nice weight
for us as I'll show you, nice connection strengths
that will make this be a very good model of
[INDISTINCT].
It's important to run the algorithm where
you take the data and on the data you increment
connection strengths and on your--this is
really a sort of screwed up version of the
data that's being infected by the prejudices
of the model.
So the model kind of interprets the data in
terms of its features then it reconstructs
something, it would rather see than the data.
Now you could try running a learning algorithm
where you take the data, you interpret it,
you imagine the data is what you would like
to see and then you learn on that.
That's the algorithm George Bush runs and
it doesn't work very well.
So, after you've been doing some learning
on this for not very long, I'm now showing
you 25,000 connection strengths.
Each of these is one of the features, take
this slide.
That's a feature and the intensity here shows
you the strength of the connection to the
pixels.
So this feature really wants to have these
pixels off and it really wants to have these
pixels on and it doesn't care much about the
other ones, mid-gray means zero.
And you can see the features are fairly local
and these features are now very good at reconstructing
twos.
It was trained on twos.
So if I show you--show it some twos it never
saw before, and get it to reconstruct them,
you can see it reconstructs them pretty well.
The funny pixels here which aren't quite right
is because I'm using Vista.
So you can see the reconstruction is very
like the data and the--it's not quite identical
but it's a very good reconstruction for a
wide variety of twos and these are ones it
didn't see during training, okay.
Now what I'm going to do--that's no that surprising,
if you just copied the pixels and copy them
back, you'd get the same thing, right?
So that would work very well.
But now I'm going to show it something it
didn't train on.
And what you have to imagine is that Iraq
is made of threes but George Bush thinks it's
made of twos, okay?
So here's the real data and this is what George
Bush sees.
That's actually inconsistent with my previous
joke because [INDISTINCT] this learning algorithm.
Sorry about that.
Okay, so you see that it perverts the data
into what it would like to believe which is
like what it's trained on.
Okay, that was just a toy example.
Now what we're going to do is train the letter
features like that in the way I just showed
you.
Forget these features that are good at reconstructing
the data, at least for the kind of data it's
trained on.
And then we're going to take the activations
of those features and we're going to make
those data and train another layer, okay.
And then we're going to keep doing that and
for reasons that are slightly complicated
and I will partially explain, this works extremely
well.
You get more and more abstract features as
you go up and once you've gone up through
about three layers, you got very nice abstract
features that are very good then for doing
things like classification.
But all these features were learned without
ever knowing the labels.
It can be proved that every time we add another
layer, we get a better model of the training
data or to be more precise, we improve a lower
band on how good a model we got of the training
data.
So here's a quick explanation on what's going
on.
When we learn the weights in this little restrictive
Boltzmann Machine, those weights define the
probability of given a vector here, we're
constructing a particular vector there.
So that's the probability of a visible vector
given a hidden vector.
They also define this whole Markov chain,
if you went backwards and forwards many times.
And so if you went backwards and forwards
many times, and then looked to see what you
got here, you'll get some probability distribution
of the hidden vectors and the weights defining
that.
And so you can think of the weights as defining
both a mapping from these vectors of activity
over the hidden units to the pixels, to images,
that's this term and the same weights define
a prior over these tons of hidden activities.
When you learn the next level of Boltzmann
Machine up, you're going to say, "Let's keep
this, keep this mapping, and let's learn a
better model of the posterior that we've got
here when we use this mapping," and you keep
replacing the posterior--implicit posterior
defined by these weights by a better one which
is the p of v given h defined by the next
Boltzmann Machine.
And so what you're really doing is dividing
this task into two tasks.
One is, find me a distribution that's a little
bit simpler than the data distribution.
Don't go the whole way to try and find a full
model, just find me something a bit simpler
than the data distribution.
This is going to be easy [INDISTINCT] Boltzmann
Machine to model, that's very nonparametric.
And then find me a parametric mapping from
that slightly simpler distribution to the
data distribution.
So I call this creeping parameterization.
What you're really doing is--it's like taking
the shell off an onion, you got this distribution
you want to model.
Let's take off one shell which is this and
get a very similar distribution that's a bit
easier to model and some parameters that tell
us how to turn this one to this one and then
that's going to solve the problem of modeling
this distribution.
So that's what's going on when you learn these
multiple layers.
After you've learned say three layers, you
have a model that's a bit surprising.
This is the last restrictive Boltzmann Machine
we learned.
So here we have this sort of model that says,
"To generate from the model, go backwards
and forwards."
But because we just kept the p of v given
h from the previous models, this is a directed
model where you sort of get chunk, chunk to
generate.
So the right way to generate from this combined
model when you've learned three layers of
features, is to take the top two layers and
go backwards and forwards for a long time.
It's fortunate you don't actually need to
generate from it, I'm just telling you how
you would if you did.
We want this for perception so really, you
just need to do perceptual imprint which is
chunk, chunk, chunk, it's very fast.
But to generate, you'd have to get backwards
and forwards for a long time and then once
you've decided on a pattern here, you go--just
go chunk, chunk, that's very directed and
easy.
So I'm now going to learn a particular model
of some handwritten digits but all the digit
classes now.
So we're going to put slightly bigger images
of handwritten digits from a very standard
data set where we know how well other methods
do.
In fact it's a data set on which support back
the machine's beat backpropagation which was
bad news backpropagation but we're going to
reverse that in a minute.
We're going to learn 500 features now instead
of 50.
Once we've learned those, we're going to take
the data, map it through these weights which
are just these weights in the opposite direction,
and get some feature vectors.
We're going to treat those as data and learn
this guy, then we're going to take these feature
vectors, we're going to tack on ten labeled
units.
So now we needed the labels but I'll get rid
of that later.
And so we've got a 510 dimensional vector
here and we're going to learn a joint density
model of the labels and the features.
We're not trying to get from the features
to the labels, we're trying to say why do
these two things go together?
So we're learning a joint model of both, not
a discriminative model.
When we've completed this learning, what we're
going to end up with is, the top level here
is a Boltzmann Machine and so it has an energy
function, and you can think of that as a landscape.
When the weights are all small here or close
to zero, then the energy landscape is very
flat.
All the different configurations here are
more or less equally good.
As it learns, it's going to carve ravines
in this energy landscape.
If you think of it as a 510 dimensional energy
landscape, these ravines are going to have
the property that in the floor of the ravine,
there's about ten degrees of freedom and those
are the ways in which a digit can [INDISTINCT]
and still be a good instance of that digit,
like a two with a bigger loop or a longer
tail.
Up the sides of the ravine, there's like 490
directions and those are the ways in which,
if you vary the image, it wouldn't be such
a good two anymore.
But the nice thing is, it's going to learn
long narrow ravine so that one two can be
very different from another two and yet connected
by this ravine, the rings captured the manifold,
so it could wander from one to another in
a way that it won't wander from a two to a
three even though the three might be more
similar in pixels to the two.
Okay.
I want to show you this generative model actually
generating.
Before I do that, I want to own up, we did
a little bit of fine tuning which actually
took longer than the original learning, where
you--after you've done that greedy layer by
layer learning, you do a bit of fine tuning
where you put in images, you do a forward
pass, bottom up with binary states and when
you do this forward pass, you adjust the connections
slightly so that what you get in one layer
would be better at reconstructing what caused
you in the layer below.
Then you do a few iterations at the top level
Boltzmann Machine, you go backwards and forwards
a few times to get the learning signal there.
And then you do a down pass.
And during the down pass, you adjust the connections
going upwards so they're better at reconstructing
what caused the activity in that layer.
So during the down pass, you know what caused
activity because you caused it and you're
trying to recover those causes.
That fine tuning helps but it will work without
it.
So now I'm going to attempt to show you a
movie.
That's not very nice.
Okay, there's that network.
Here where we're going to put images.
Here's 500 features, 500 features, 2,000 features
and the ten labels.
First of all, we're going to do some perception.
So I'm going to give it an image and tell
it to run forwards.
Oops, sorry?
I didn't mean that.
I meant that.
And you'll see, these are stochastic, they
keep changing, but it's very sure that it's
a four.
See, those are the identities of these neurons.
It knows that's a four and it has no doubt
about it, even though its feature detectors
are fluctuating a bit.
If I give it a five, hopefully it'll think
it's a five.
Yeah, it doesn't have any doubt.
So now let's be mean to it because that's
a lot more fun.
I'm going to give it that.
So, it says, so, four, six, eight, four, eight,
eight, eight, eight, eight, eight, four.
It can't make up its mind whether it's a four
or an eight, and that's pretty reasonable
in those circumstances.
It will actually, for that one, say eight
a bit more often than anything else.
So, we've classed it as getting that right
but it's very unsure whether it's an eight
or a four.
And just occasionally, it thinks it can be
other things like a two, but it basically
thinks four or eight.
I can make it run faster so you can--okay.
It's basically four or eight, an occasional
six.
I could give it something like this and it
thinks basically one or seven and occasionally
a four.
Because I programmed this myself, I want to
point out that it's very reasonable for this--this
is my baby, and it's very reasonable for it
to think that I might be four, because, look,
you could see the four in there, okay.
Okay.
Now, that was just doing perception but the
very same model does generation.
So, what I can do is I can fix the top level
unit and all I've done is I've fixed the state
of one neuron.
There's a million connections there because
that's 2,500.
I just fixed this one neuron but when I fix
that state, then the weights, the 2,000 weights
coming out of there to these neurons here,
what they'll do is they'll lower the energy
of the ravine for twos and they'll raise the
energy of the ravine for all of the other
guys.
So, now we've got this landscape in which
you got all these ravines but the two ravine
has been lowered.
And if you put it at a random point, it will
eventually stumble into the two ravine and
then it will stay there and wander around.
So, let's see if we can do that.
So, what's really going on here is I'm just
going backwards and forwards up here.
Ignore that for now.
I'm going backwards and forwards here and
letting it gradually settle until it's into
a state that this network's happy with.
So, that's his brain state and that doesn't
mean much to you.
If you look at that, you don't really know
what it means.
So, what we're going to do is, as it's settling,
we're going to play out the generative model
here.
We're going to do computer graphics to see
what that would have generated.
And so, what you got here is that's what's
going on in its brain and this is what's going
on in its mind.
So, you can see what this is thinking and
I'm serious about that.
That is--I know it sounds crazy, when I say
to you I'm seeing a pink elephant, what I
mean is, I've got a brain state such that,
if there were a pink elephant out there, this
will be perception.
That's how mental states work.
They're funny because they're hypothetical,
not because they're made of spooky stuff.
So, I use this language where the terms refer
to things in the world because I was saying,
"What would have to be in the world for this
brain state to be perception?"
Now, if I got a generative model, I can take
the--take the brain state and say, "Well,
what would have to be in the world for that
to be perception?"
Well, that.
So, that's what it's thinking, that's its
mental state right there.
So, you got brain states and mental states
and most psychologists won't show you both.
Let's go a bit faster.
And it still hasn't settled into the two ravine.
And now it's about in the two ravine.
And now it's just wandering around in that
two ravine and this is what it's thinking.
It knows about all sorts of different twos
and it's very good that it does because that
means it can recognize weird twos.
Let's give it another one.
It hasn't got into the eight ravine properly
yet.
It will jump [INDISTINCT] the ravines, he's
not really there.
But by now, it will be in the eight ravine
and it will show you all the sorts of different
eights it believes in, if you run it long
enough.
If you run it for an hour now, it would probably
just stay in the eight ravine showing you
all sorts of different eights, okay.
Let's do one more because I liked it so much.
Again, it's not really in the five ravine
properly yet.
No, that was a six.
By now it's in the five ravine and it will
show you all sorts of weird fives, ones without
tops, some occasional sixes.
And it ends up with a pretty weird one but
that's definitely a five and it's very good
that it knows that that's definitely a five
because it lies to recognize things like that.
Okay.
That's it for the demo.
I have to get rid of that.
Okay.
So, here's some examples of things it can
recognize.
These are all the ones it got right and you
can see it.
It recognizes a wide variety of twos.
It recognizes that this is a one despite that
and it recognizes that this is a seven because
of that.
If you try writing a program by hand, it will
do that.
You'll find it's kind of tricky if you'd never
thought of these examples in advance.
If you compare it with support vector machines,
now what we're doing here is we're taking
a pure machine learning task.
We're not giving it any prior knowledge about
pixels being next to other pixels.
We're not giving it extra transformations
of the data.
So, this is without--it's a pure machine learning
task without any extra help.
If you get extra help, you could make all
the methods a lot better.
But a support vector machine done by DeCoste
and Scholkopf were very good, it got 1.4%.
The best you can do with standard backpropagations
is about 1.6%.
This gets 1.25% and significance here is about
a difference of 0.1.
So, this is significantly better than that.
[INDISTINCT] maybe gets 3.3%.
Now, I fine-tune that to be good at generations
so I could show you it generating using this
sort of up down algorithm but we can also
use backpropagation for fine-tuning.
And now that I've got this way of finding
features from the sensory data, I can say
things like nobody in their right mind would
ever suggest that you would use a local search
technique like backpropagation to search some
huge non-linear space by starting with small
random weights.
It will get stuck in local [INDISTINCT].
And that is indeed true.
What we're going to do is we're going to search
this huge non-linear space of possible features
by finding features in the sensory data and
then finding features in the combinations
of features we find in the sensory data and
keep doing that.
And we'll design our features like that.
So, we didn't need labels, we just need sensory
data.
Once we designed all our features we can then
use backpropagation too slightly fine-tune
them to make the category boundaries be in
the right place.
So, a pure version of that would be to say
let's learn the same net but without any labels.
Okay?
So, we do all the pre-training like this.
After we pre-trained now, what we're going
to do is we're going to attach ten label units
to the top and we're going to use backpropagation
to fine-tune these and the fine-tuning is
hardly going to change the weights at all
but is going to make the discrimination performance
a lot better.
So, this is going to be discriminative fine-tuning
and [INDISTINCT] 1.15% errors and all the
code for doing the pre-training and the fine-tuning
is on my webpage, if you want to try it.
Now, given that we now know how to get features
from data, we can now train things we never
used to be able to train with backpropagation.
If you take a net like this where we're going
to put in the digit, and we're going to try
and get out the same digit but we're going
to put like eight layers of non-linearities
in between, if you start with small random
weights and you backpropagate, you get small,
small times small, and by the time you get
back here, you get small to the power eight
and you don't get any gradient.
If you wrote in big random weights, you'll
get a gradient but you'll have decided in
advance where you're going to be in the search
space.
What we're going to do is learn this Boltzmann
Machine here.
After we've learned that, we're going to map
the data to get activity patterns and then
this Boltzmann Machine.
Then we're going to learn this Boltzmann Machine
but with linear hidden units.
And then what we're going to do is put the
transposed weights here because this is good
at reconstructing that.
So, this should be good and so on.
And we're going to use that as a starting
point and then we do backpropagation from
there and it will slightly change all of these
weights and it will make this work really
well.
And so now what it's done is this communicated
this 28 by 28 image via this bottleneck of
30 units but using a highly non-linear transformation
to compress it.
If you make everything linear here, you leave
out all these layers and make everything linear,
this is PCA, Principal Components, which is
a standard way to compress things.
If you put in all these non-linear layers,
it's much better than PCA.
So, this is all done without labels, now.
You just give it the digits, you don't tell
it which is which.
These are examples of the real digits, just
one example of each class.
These are the reconstructions from those 30
activities in the hidden layer and you can
see they're actually better than the data.
This is a dangerous line of thought.
PCA does this and you can see it's kind of
hopeless compared to this method.
At least that's what you're meant to see.
Now, we can apply this to document vectors.
I don't find documents as interesting as digits
but I know some people are interested in them.
You could take a document vector and you could
take the counts of the 2000 most common words
and there's a big database like this of 800,000
documents.
And so we took 400,000--sorry.
Yeah, I know.
I see people smiling.
[INDISTINCT] 100,000, I'm an academic, okay.
We then train up a neural net like this, where
these are now [INDISTINCT] units.
For those of you who know machine learning,
we can use any units in the exponential family,
where the log probability is linear in the
parameters.
So, we train up this to get some features,
we train up this to get some features, and
then we train up this until you get just two
linear features.
That seems a little excessive and obviously
when we reconstruct, we're not going to get
quite the right counts.
But you'll get a--you'll get counts that are
much closer to the right counts in the base
rates.
So, we've done here, you have a high count
for Iraq and Cheney and torture, up here,
you'll get high counts for similar things.
So, we can turn a document into a point in
the two dimensional space.
And of course once we got a point in two dimensional
space, we can plot it in 2D.
And for this database, someone had gone through
by hand, more or less by hand, and labeled
all the documents.
We didn't use the labels, okay.
But now when we plot the point in 2D, we can
color the point by the class of the document.
So, if you do the standard technique which
is Latent Semantic Analysis which is just
a version of PCA, and you layout these documents
in 2D, that's what you get.
And you can see the green ones are in a slightly
different place from these blue once but it's
a bit of a mess.
If you use our method, it does a little bit
better.
You get that.
And so now, if you look at these documents--these
are business documents, right?
If you look at these documents here, you can
see there's lots of different kinds of documents
about accounts and earnings.
Presumably, there's an Enron cluster in here
somewhere and it would be very nice to know
which are the companies that are in this Enron
cluster.
Okay.
But there's something more interesting you
can do.
That's just for visualization.
But now I'm going to show you how to solve
the following problem.
Suppose I'd give you a document.
So, this isn't like what I call Google Search
where you use a few key words and you find
what you want.
This is--I give you a document and I ask you
to find similar documents to the one I gave
you.
Okay?
Documents with similar semantic content.
So, I'm using a document as a query.
What we're going to do is we're going to take
our big database of documents, a whole million
of them, and we're going to train up this
network and it's going to convert these documents
into 30 numbers.
I'm going to use logistic units here, that
is numbers that range between 1 and 0 and
we're going to train it as Boltzmann Machines.
Then we're going to back propagate and we'll
get intermediate values here that convey lots
of information.
And then we're going to start adding noise
here and we're going to add lots and lots
of noise.
Now, if I add lots and lots of noise to something
that has an output between 0 and 1, there's
only one way it can transmit a lot of information.
It's got to make the total input that comes
from below be either very big and positive,
in which case it'll give one, or very big
and negative, in which case it'll give a zero.
And in both those cases, it will resist the
noise.
If it uses any intermediate value, the outcome
will be determined by the noise.
So, it won't transmit information, so it won't
be very good at getting the right answers.
>> So the noise is something like Gaussian,
it's not binary flipping.
>> HINTON: It's Gaussian noise.
And we gradually increase the standard deviation
and it's noise in the input to the unit.
And we gradually increase this, and we use
a funny kind of noise that I don't want to
get into, that makes it easier to use conjugate
gradient descent.
And what will happen is, these will turn into
binary units.
So, we now have a way of converting the word
can't vector a document into a 30 bit binary
vector.
And now we can do what I call supermarket
search.
So, suppose you want to find things that are
like a can of sardines.
What you do is you go to your local supermarket
and you say to the cashier, "Where do you
keep the sardines?"
And you go to where the sardines are and then
you just look around and there's all the things
similar to sardines because the supermarket
arrange things sensibly.
Now, it doesn't quite work because you don't
find the anchovies, as I discovered when I
came to North America, I couldn't find the
anchovies.
They weren't anywhere near the sardines and
the tuna.
That's because they're near the pizza toppings.
But that's just because it's a three dimensional
supermarket.
If there was a 30 dimensional supermarket,
they could be close to the pizza toppings
and close to the sardines.
So, what we're going to do is we're going
to take a document and using our learned network,
we're going to hash it to this 30-bit code.
But this is a hash code that was learned.
It's not some random little thing.
It was learned with lots of machine learning.
So, it has the property that similar documents
mapped to similar codes.
So, now we can use hashing for doing approximate
matches.
Everybody knows hashing is nice and fast and
everybody usually can't do approximate matches.
But with machine learning, you can have both.
So, we take our document, we hash it to a
code and in this memory space, at each point
in the memory space, we put a pointer to the
document that has that code and your [INDISTINCT]
so if two documents have the same code, you
can figure out what to do.
So now, with the query document, we just go
there and now we just look around like in
the supermarket.
And the nearby similar documents will have
nearby codes.
And so, all you need to do to find a similar
document is flip a bit and do a memory access.
Okay.
That's two machine instructions.
So, if you were to have a database, let's
say 10 Billion documents, and I give you one
and say, "Give me a 100,000 documents similar
to this one," from my other search technique
I'm going to use, it can only cope with a
100,000.
You're going to have to do a 100,000 times,
you're going to have to flip a bit and do
a memory access.
So, that's only 200,000 machine instructions.
I need two machine instructions per document.
It's completely independent of the size of
your database.
Okay.
Because you've laid things out like in a supermarket,
you've got a document supermarket now [INDISTINCT]
so, if you compare it with--well, we've actually
only tried it because we're academic, on 20-bit
codes and a million documents and it works
just fine, but nothing could possibly go wrong
when you scale it up.
It's actually quite accurate.
That is, if you compare it with a sort of
gold standard method, it's about the same
accuracy and when you now take your shortlist
that you find in this very fast way and you
give those guys in the shortlist to the gold
standard method, it works better than the
gold standard method alone.
It's much better than locality sensitive hashing
but if in terms of speed, we use the code
that's on the web for that and it's about
50 times faster.
And in terms of accuracy, locality sensitive
hashing will always be less good than this
because it's just a hack for doing this.
And locality sensitive hashing works on the
count vector.
If you work on the count vector, you will
never understand the similarity between the
document that says, "Gonzales quits," than
the documents that says "Volfovich resigns."
They're very similar but not in the word count
vector.
But if you've compress it down to some semantic
features, they're very similar documents.
So, the summary is that I showed you how to
use this simple little Boltzmann Machine with
the bipartite connections to learn a layer
of features.
Then I showed you that if you take those features,
you can learn more features.
And as you go up this hierarchy, you get more
and more complicated features that are going
to be better and better for doing classification.
This produces good generative models.
So they're good at reconstructing data, or
producing data like the data you saw.
If you fine-tune with this [INDISTINCT] algorithm
which has this funny name, if you want good
discriminative models, what you do is then
you fine-tune with backpropagation.
But the good news is you don't need labels
for all of your training data.
You can learn all these features on a very
big data sets then with just a few million
labels or even a few hundred labels, you can
backpropagate to fine-tune it for discrimination.
And that will work much better than for example
using any machine learning method that just
uses the label data.
It's a huge way.
You can use the unlabeled data very effectively.
And I've shown you that it can also be used
for explicit dimensional [INDISTINCT] where
you get [INDISTINCT] bottleneck and that you
can do search for similar things very fast.
And of course we'd like to apply it to images,
but for images you have a problem which is
in documents, a word is very indicative of
what the document is about.
In an image, what's indicative of what the
image is about is a recognized object and
so what we are trying to do now is make it
recognize objects so that [INDISTINCT] then
we can get the objects in the image and then
apply the semantic hashing technique.
But we haven't done that yet.
I see I've manage to talk very fast so I can
show you a little bit about how we're going
to do the image recognition.
Suppose you want to do generative model which
would allow you, a graphics model, to take
a type of an object and produce an image of
that object.
So, I say square and I say what it's pose
is, its position orientation.
Then we might have a top-down model of--from
this and this, predicts where the parts might
be.
And if it's a kind of sloppy model, it'll
say this [INDISTINCT] to be round about there,
and this [INDISTINCT] to be round about there.
And if we pick randomly from these distributions,
we'll get a square where the edges don't meet
up.
Now, one way we can solve that is to generate
very accurately here.
We could say, I'm going to generate each piece
just right.
But that requires high bandwidth and lots
of work.
We're going to generate sloppily.
We're going to generate a redundant set of
pieces and then we're going to know how the
pieces fit together.
We're going to know a corner must be co-linear
with an edge and the edges here must be co-linear
with corners.
And now, by lateral interactions here, using
something called a Marker Finder Field, we
can get it to settle into that.
And so now, [INDISTINCT] process is at each
level, the level above says where the major
pieces should be, roughly, and a level that
knows about how these pieces go together,
like how eyes and noses and mouths go together,
says, "Okay, the nose should be exactly above
the middle of the mouth and the eyes should
be at exactly the same height."
The level above doesn't need to specify that,
that's known locally.
So, how are we going to learn that?
Well, we're going to introduce lateral interactions
during the visible units.
That's fine.
The real crucial thing in these nets is you
don't have lateral interactions during the
hidden units.
So, we can learn that and the way we learn
that is we put an image in here, we activate
the features then with the features fixed
providing constant top-down input, we run
this lateral interactions to let this network
settle down and we replace the binary variables
by real value variables.
So, we're doing something called mean-field.
We let this settle down with something it
is happier with, a reconstruction.
It doesn't need to get all the way to equilibrium,
it just needs to get a bit better than this.
And then, we apply a normal learning algorithm
to these correlations and these correlations,
like this.
But we can also learn the lateral interactions
by saying, "Take the correlations in the data
minus the correlations in the reconstructions,"
and that'll learn all these lateral interactions.
So now what we're going to do is, we're going
to learn a network with 400 input units for
20 by 20 patch of an image.
This is just preliminary work.
When we learn the first network, these aren't
connected.
Then when we use these feature activities
to learn the second level Boltzmann Machine,
we connect these together and we learn these
and these.
Then when we learned the top Boltzmann Machine,
we connect these together and we learn these
weights and these weights.
When we're finished, we can generate from
the model.
And so as a control, what we're going to do
is, we're going to learn this model on patches
and natural images which have notoriously
[INDISTINCT] things to model because anything
could happen in a patch and natural image.
So, it's a very hard thing to build identity
model of.
We're going to learn it without lateral connections
and we get a model that's very like many other
models.
When you generate from it, what you get is
clouds.
So, here's natural image patches and they
have the property that there's not much going
on and then there's a sudden [INDISTINCT]
of structure like here.
So, if you apply a linear filter to these
things, the linear filter will usually produce
a zero and occasionally produce a huge output.
If you apply a linear filter to these things,
it will produce some kind of Gaussian distribution.
These have exactly the same [INDISTINCT] of
spectrum as these.
What they don't have is this sort of heavy
tailed distribution where there's not much
happening and then a lot happening, and long
range structure.
So, now what happens if we put in the lateral
interactions and do the learning again?
If you put the lateral interactions in, they
can say things like if you have a piece [INDISTINCT]
and you'd like a piece of that somewhere around
here, put it here where it lines up.
So, that will make much longer range interactions.
And so now when we generate from the model
with lateral interactions, we get that and
you can see that these are much more like
real image patches.
They pass many of the statistical tests for
being real image patches.
They've got this kind of much longer range
structure.
They've got sort of co-linear things and things
at right angles and all sorts of nice structure
in them, which we didn't have before.
And so we're getting--this is probably the
best model there is of natural image patches.
If you ask anybody else who models them, "Show
me samples from your generative model."
They say, "Oh, well, we tried that and it
looked terrible.
So we never published those."
This is, I think the first model, it generates
nice samples from the model.
[INDISTINCT] has the models maybe comparable.
What we'd like to do now is make more layers
and we'd also like to have attention.
So, as you go up, you focus on parts of the
image.
And what I want to do is get something--you're
given an image, you go up, it's focusing on
parts and it gives you a figure at the top.
It gives you what you see, which is you look
at an image and you see a face.
And then you look again, you see the eye.
Then you look again, you see a group of four
people.
And those are the things that come out and
those are going to be like the words that
need to go and turn image retrieval system.
You going to have--this is going to run for
long time learning and then it's going to
run for quite a long time on each image, but
that's all [INDISTINCT] okay.
I'm done.
>> So, it looks like we've got time for questions.
If you have questions can you--if you have
questions, can you please hit the mic in the
middle so that the folks at their offices
can hear.
>> Okay.
Hi.
So, you were saying that this method doesn't
require labels.
I was just wondering if it would actually
help if you have labels for at least some
of your training data?
>> HINTON: Oh, yes.
Labels help.
The main thing is to show that you can do
lot without them and therefore you can have
much more leverage from a few labels.
Yeah.
>> Okay.
Thanks.
>> HINTON: So, for example in the semantic
hashing idea, you could, as you're learning
those 30 dimensional codes, you could say
if two things are from the same class and
the codes are far apart, introduce a small
force pulling them together.
And we've got a paper on that in [INDISTINCT]
last year.
And that will improve the sort of clustering
of things of the same class.
But the point is you can do it without knowing
the classes as well.
>> Hi.
Now, so, people have built all the encoders
for a long time before and they use regular
sigmoid units and use backprop to train them.
>> HINTON: But they never work very well.
>> Correct.
Would--if we actually have multiple layers
of these, over these sigmoid units and use--and
train them the same function as you're doing,
one layer at a time, would it work as well,
as RBMs or not?
>> HINTON: Okay.
That's a very good question.
So, it's a bit confusing.
This deep thing with multiple layers trained
with RBMs are called mutli-layer auto encoder.
But you could also have a very small auto
encoder with one hidden layer that's non-linear
and train that up.
And the RBM is just like that.
So, you could train these little auto encoders
and stack them together and then train the
whole thing with backprop.
That's what the question was.
And that will work much better than the old
way, training auto encoders, but not quite
as well as this.
So, Yoshua Bengio has a paper.
Where he compared doing auto encoders with
doing restricted Boltzmann Machines, and the
restricted Boltzmann Machines worked better
specially for things like [INDISTINCT] backgrounds.
>> I've got a--I've got a question which--if
I could ask...
>> HINTON: Okay.
>> ...because I'm holding a microphone.
So, this morning we were talking about the--about
news with--where the problem with news is
that everything changes from day to day.
Do you have any intuition--this is one of
those unfair, "What do you think would happen,"
do you have any intuition on how hard it would
be to adapt a deep network like this once
your input distribution changes or as it continues
to change?
>> HINTON: Okay.
So one good thing about this learning is everything
scales linearly with [INDISTINCT] training
data.
There's no quadratic optimization anywhere
that's going to screw you for big databases.
The other thing is, because it's basically
stochastic online learning, if your distribution
changes slightly, you can track that very
easily.
You don't have to start again.
So, if it's the case that the news tomorrow
has quite lot in common with the news over
the last few months and few years, and you
just need to change your model a bit rather
than start again, Then this very good for--going
to be good for tracking and it's not going
to be as much work as learning it all in the
first place.
And in fact, once you got all of these layers
of features, basically changing the interactions
in high level features will get you lots of
mileage without much work.
>> Sir I have another question about the a--so,
about the supermarket search.
You were saying you just flip a bit in your
hash code.
So, what I'm wondering is, you know, one thing
that I'm not sure about is like if you flip
one of these bit you might not necessarily
get something there?
>> HINTON: That's fine.
>> I mean, how do you know that you're going
to find something there?
And then also, maybe, is there some way of
finding better bits to flip and like how do
you decided which ones?
>> HINTON: So, of course.
If you make the number of addresses be about
the same as the number of documents, the average
answer is one.
>> Right.
>> HINTON: Okay.
And you've--if there's nothing there, you
can flip for more bits.
>> Sure.
>> HINTON: So, yes.
You'll get some misses but that's just sort
of constant
>> Right.
>> HINTON: We can look at, actually, how evenly
spread over addresses it is and typically,
most of the addresses won't be used and a
typical address would be used like three or
four times.
So, it's not as uniform as we'd like but that
could all be improved.
And we've only done this once.
We've just trained this network once on one
data set and that's all the research we've
done so far, really.
If we could get a tiny bit of money from someone,
we could make this whole thing work much better.
>> So, one thing that is special about digits
is that they evolve in a way that they make
them discriminative.
>> HINTON: Yes.
>> So, you would hope you--it's not that surprising
that it then unsupervised way can attract
features that are discriminative.
I was wondering what happens with [INDISTINCT]
the other applications where--so clearly,
when you do unsupervised, you might throw
away some very indicative features right there.
>> HINTON: Yes.
So, basically, there's two kinds of learning,
there's discriminative learning where you
take your input and your whole aim in life
is to predict the label.
And then there's generative learning where
you take your input and your whole aim in
life is to understand what's going on in this
input.
You want to build a model that explains why
you got these inputs and not other inputs.
Now, if you do that generative approach, you
need a big computer and you're going to explain
all sort of stuff that's completely irrelevant
to the task you're interested in.
So, you're going to waste lots of computation.
On the other hand, you're not going to need
as much training data because each image is
going to contain lots of stuff and you can
start building your features without yet using
information in the labels.
So, if you've got a very small computer, what
you should do is discriminative learning so
you don't waste any effort.
If you got a big computer, do generative learning,
you'll waste lots of the cycles but you'll
make better use of the limited [INDISTINCT]
label data.
That's my claim.
>> Hi Geoff.
I have a question.
What happened to regularization?
What kind of regularization is implicit in
all of your stages?
>> HINTON: Okay.
So, we're using a little bit of weight decay
and the way we set the weight decay was just--we
fiddled about it for a bit to see what worked
on the--on a validation set, the usual method.
And if you don't use any weight decay, it
works.
If you use weight decay, it works a bit better.
And it's not crucial how much you use.
So, we are using some weight decays here but
it's not a big deal.
And like I say, all of the code is in [INDISTINCT]
on my web page.
There's a pointer on my web page.
So, you can go and look at all those things
and all the little fudges we use.
>> Right.
But the Boltzmann Machine is fundamentally
sort of entropic regularization and then your
little pieces of tuning with weight decay
are from the other family.
So, you're blending both [INDISTINCT]
>> HINTON: No.
The Boltzmann Machine, it's true.
There's a lot of regularization comes on from
the fact that the hidden units are binary
stochastic.
So, they can't transmit much information.
>> Yes.
>> HINTON: That does lots of regularization
for you, compared with the normal auto encoder.
But in addition, we say don't make the weights
too big.
And one reason for that is not just regularization,
it's--it makes the Markov chain mix faster
if you don't make the weights too big.
>> Thanks.
>> Hi.
So, in your example of digits, you actually
tell them--tell the algorithm that they are
ten classes.
>> HINTON: Yes.
>> So, I wonder, well, what is the impact
if we do not give this number correct?
So, yeah.
>> HINTON: Okay.
So, what you can do is you can take this auto
encoder that goes down to 30 real numbers
and not tell it how many classes there are,
just give it the images, get these 30 real
numbers.
Then you can take those 30 real numbers and
apply dimensionality reduction technique that
Sam Roweis and I have developed, and the latest
version of that, you can lay them out in 2D
and you will get 11 classes.
And it did that without ever knowing any labels.
You'll get just these 11 clusters which is
close to 10.
It often thinks that the continental sevens
are a separate clusters.
>> So you are saying this is [INDISTINCT]
you have try and that's what happened or?
>> HINTON: I might even have it in this talk
somewhere.
I might not, though.
It's on my--it's--oh, there you go.
That's pure unsupervised on the digits.
Now in this case, these are twos and these
are twos.
In 30D, it's got the clusters.
When you force it down to 2D, it wants to
keep the twos next to each other but it will
also wants these--these are the spiky twos
and these are the sevens, and it wants those
close.
And these are the loopy twos and these are
the threes, and it wants those close.
But it also wants the threes close to the
eights.
And so in 2D, there just isn't enough space
to make ten clusters.
But look, it made 11 there and if I don't
cheat and do this in black and white, you
can still see there's sort of roughly 11 clusters.
So, this was pure unsupervised and it found
that structure in the data.
So, when psychologists tell you, you impose
categories on this data, they aren't really
there in the world, it's rubbish.
I mean, they're really there.
>> So the magic number is 30.
Is it--if I choose other number, it will be
fine with it?
>> HINTON: If you choose a smaller number,
you might not preserve enough information
to be able to keep the classes.
And if you choose a bigger number, then PCA
will do it better.
So your comparison with PCA won't be as good.
>> Thank you.
>> How does the performance of the digit classification
vary according to the number of layers you
are using?
>> HINTON: Okay.
Obviously, using the number of layers I showed
you is one of the best numbers to use.
If you use less layers, it works a bit worse.
If you use more layers, it works about the
same.
I've now got a--I've got a very good Dutch
student who has the [INDISTINCT] he doesn't
believe a word I say, and we will know--he's
using like 40 cluster machines and he's going
to get the answer to this.
But so far, I'm right that using less layers
isn't as good and he hasn't got to more layers
yet.
He's actually made with the same number of
layers, he can make it work better and we'll
see if he makes it work better with more layers.
>> Just [INDISTINCT] guess a related question.
So, it's clear how to evaluate this models
say if you have some labeled data and [INDISTINCT]
you can try to see if you predict it similarly.
But if you try generative, this Boltzmann
Machines with like, especially [INDISTINCT]
interactions in the same levels and so on,
if I gave you another set, can you say how
good generatively it is and is it easy?
>> HINTON: Okay.
>> How do you evaluate...
>> HINTON: Yeah.
>> ...that kind of part of it?
>> HINTON: So, the problem with these Boltzmann
Machines, this is a partition function, and
what you'd love to do is take your data set,
hold out some examples, train your generative
model on the training set and then say what
is the log probability of these held out examples?
>> Exactly.
>> HINTON: And that would be the sort of gold
standard.
And that's very hard to do.
You know the log probability up to a constant
but you don't know the constant.
So, people in my group and I are working very
hard on a method for interpolating between
Boltzmann Machines that allows you to use
a Boltzmann Machine with zero weights which
is a pretty dumb model and then gradually
change the weights towards the Boltzmann Machine
that you eventually learned and you can get
the ratio of the partition functions of all
these Boltzmann Machines so in the end, you
can get the partition function.
You can get a pretty good estimate.
This is called--it's a version of a [INDISTINCT]
important something called bridging.
And we think we're going to be able to get
pretty accurate estimates of the partition
function now by running for like, you know,
a 100 hours.
>> Yes.
Yes.
>> HINTON: You do this after you've learned
just to show how good you are.
But the other thing you can do is you can
generate from the model and you can see that
the stuff it generates looks good and you
can then take the stuff you generated from
the model and you can apply statistical test
to that and statistical test to the real data
and statistical test to the other guy's data,
the other guy's generative data.
And if you choose the right statistical test,
you can make the other guy's data look terrible.
>> Okay.
Okay.
I think we're out of time now.
I'd like to thank Geoff again and...
