YANN LECUN: Yeah, I spend about
half my career in industry
and half in academia now
I have one foot in each.
I guess it's more like one half
of a foot in academia
now because I'm spending most
of my time at Facebook
but, you know, things change.
I basically started
the first research lab
at Facebook fair.
There's a bunch of research labs
and Facebook but this was
the first.
And this is a bit
of a cultural shock
for Facebook, which was
a very sort of engineering
short term oriented company that
had to kind of invent itself
a new culture for research.
And that's what I liked
about it, the fact
that it could start from scratch
and basically establish
this culture.
So what we have-- what we're
doing at Facebook
is open research.
Facebook research is really
outward focused and all
the research we do is published
and all the--
almost all the code that we
write is open sourced.
With that we hope to foster
interesting problems that we
think are interesting and can
see the community
towards working on provenance
that we think are important.
The thing here is not
whether you know Facebook
technology is ahead of Google's
or Microsoft's
or whatever but more the fact
that the products that we want
to build are not possible today.
We don't have the science even
the basic principles
to build the stuff we wanted--
we want to build.
And so since we don't have
a monopoly on good ideas,
the best we can do is basically
accelerate progress
of the entire community
and that's one of our goals.
Of course we have a big impact
on the company,
in fact, a much bigger impact
than Mark Zuckerberg thought
we would have five years ago
when fair was created.
And today Facebook is really
kind of entirely
built around deep learning.
If you take the printing out
of Facebook today you get dust,
essentially.
Not entirely, but you know what
I mean.
OK so most of learning-- machine
learning today, you know,
machine learning has had
a huge impact in various areas
of business and society
and science, certainly.
But most of the applications
of machine learning today
basically you supervise running
one of those three
main paradigms of learning,
right.
So the supervisor learning,
there is reinforcement learning,
which people have been talking
about a lot
in the last few years,
and then there is another thing
that's not very well defined
called unsupervised learning
or self supervised learning
which I'll talk about later.
So supervised learning
is this idea by which if you
want to train a machine,
for example, to classify images
of cars from airplanes,
you show an image of a car,
you run this through the machine
and the machine has
adjustable knobs on it.
And if it says car, you don't do
anything to the knobs.
If it doesn't say car,
you adjust the knobs
so that the answer the machine
produces
gets closer to the answer you
want, OK.
There is a desired answer
that you give to the machine
and you can measure
the discrepancy
between the answer you want
an answer the machine produces.
And then you show the image
of the airplane
and you do the same.
And by kind of tweaking
the parameters,
you know, with thousands
of examples, eventually perhaps
the knobs will converge
to a configuration
where all the cars are correctly
classified and all the airplanes
are correctly classified.
And the magic of this
is that it may even work
for airplanes and cars it's
never seen before.
That's the whole purpose
of learning is that you run
the concept without having
to memorize every example.
So this type of learning,
supervised learning,
works really well if you want
to do speech recognition, that
being speech two words, images
to categories, face recognition,
generating captions for photos,
figuring out the topping
of the text,
translating from one language
to another,
that's all supervised learning.
And the basic idea of this
or the tradition
starts with models
from the late 50s, early 60s,
the Perceptron, the Adeline.
Which interestingly at the time
were really sort of hardware
devices.
That were not programs
on the computer,
they were actually
analog computers that were
built. At the bottom here is
a Perceptron, what you see here
is Bernie Woodrow reviving one
of his old Adeline systems
at Stanford.
So that created
the standard model of pattern
recognition
that really was
prevalent until fairly recently.
By which you take a rough signal
and you feed it to what's called
feature extractor which is hand
engineered, OK.
It's designed by people
to basically extract
relevant information
from the raw signal.
And then you feed the result,
the feature vector,
to classifier,
something
like a linear classifier
or nearest neighbor or a tree
or whatever.
There's a lot of techniques
that people have come up
with over the last 50 years,
if not 60, more like 60 years
actually,
they used food to do this.
And that's kind
of the standard way.
And what deep learning changed
is the idea that you can learn
this feature extractor.
So instead of having
to engineer, you know, spend
a lot of time and expertise
and money
on building those things
for every new problem,
you can basically train
a future extractor as part
of the entire process.
So basically you learn--
you build a machine as a cascade
of trainable modules
and which you can call layers,
and essentially all
those modules that transform
[INAUDIBLE]
using supervised learning.
And you hope that the machine
will learn not just to classify
but also to figure out what
the relevant features are
that need to be extracted
at every layer for the system
to do a good job.
So the next question you can ask
is, what do we put
in those boxes?
And the answer to this
is not recent, it's the idea
of artificial neural networks.
So in a artificial neural net,
essentially, the layers
are of two types.
One type is just
a linear operator.
So imagine that the signal is
represented as a vector,
[INAUDIBLE] numbers.
The pixel values or signal, you
know,
whether it's
audio or financial terms
or whatever.
Represent this as a vector,
multiply this by matrix, OK.
So what a matrix does
is that when you computed
the product with this matrix
by this vector
is you computing the dot product
of the vector
by every row in the matrix
and that's like computing
a weighted sum of input
features.
It's actually represented here.
So you have a bunch of here's
a vector, with components
of this vector you're computing
a weighted sum where the weights
are the coefficients
in the matrix and that gives you
an output.
And then there is another type
of function here
which
is the pointwise non-linearity.
So you take this vector and then
apply the nonlinear function
to every component
of this vector independently.
In this case, what's called
a value which is really just
a halfway rectifier, OK.
So a function
is
identity for positive arguments
and equal to zero
for negative arguments,
very simple non-linearity.
And there are theorists that
show that with only two layers
of this
or linear, nonlinear linear,
you can approximate any function
you want as close as you want
as long as the dimension
of this vector
is sufficiently large, possibly
infinite.
But there is
no real theoretical results
on this,
but what we know empirically
and intuitively is that
by stacking lots of layers
of those, you can represent
many functions very efficiently.
So that's the whole motivation
to deep learning
which is that by stacking
multiple layers of linear--
alternating linear and nonlinear
operators, you can approximate
a lot of useful functions
very efficiently.
AUDIENCE: [INAUDIBLE] theory
proved that the deep learning
net can
approximate the simple function.
So if we can go
to the simple function
[INAUDIBLE].
YANN LECUN: What do you mean
by simple?
AUDIENCE: A simple function
is a special term saying
basically [INAUDIBLE].
YANN LECUN: Yeah, yeah.
So the theorems
from the late '80s that show
that with just two layers,
with just one
layer of non-linearity, you can
approximate any function you
want.
OK, but there is no limit
on the dimension
of the middle layer.
So for the longest time,
because of the limitations
of computer power
and because the data sets were
small, those approaches,
you know, the number
of applications that we could
apply these to were very
limited.
I mean, the basic techniques
for this are from the late '80s.
But the amount of problems
for which we could have
enough data to train
those systems was very small.
You know, we could use it
for, maybe, handwriting
recognition and speech
recognition and maybe
a few other applications,
but it was kind of limited.
People, at the time, actually,
in the '80s,
were really interested
in hardware implementations.
And this is coming back.
So there is a whole industry now
that has been restarted
over the last three to five
years
on building special purpose
chips to run those neural nets
efficiently,
particularly
for embedded devices.
So probably within the next two
years, every smartphone
will have
a neural net accelerator
embedded in it.
And, you know, within five
years, it'll be in every car.
And, you know, shortly
after that in, basically,
every electronic device that you
buy
will have
a neural net accelerator in it.
And so your vacuum cleaner
will have smart computer vision
in it.
Because you will have a $3 chip
that does
neural net acceleration.
So, you know, how do we train
those things?
So we train them by, you know,
it's basically large-scale
optimization.
So the supervised learning,
you measure the discrepancy
between the answer the machine
produces and the answer
you want through some sort
of objective function
that basically just measures
the distance or something
like that of some kind.
Then you can average this
over a training set of samples
of pairs of input and output.
And the process by which you
tune the parameters
of the system
is just gradient descent.
So figure out in which direction
to change all the knobs so
that the objective function goes
down,
and then take a step
in that direction.
And then sort of keep doing this
until you reach some sort
of minimum.
So what people use is something
called stochastic gradient
descent where you estimate
the gradient on the basis
of a single sample
or maybe a small batch
of samples, right?
So you show a simple example,
figure out the error
between those two things,
then compute the gradient
of that function with respect
to all the parameters,
tweak the parameters,
then go to the next sample.
Stochastic because you get
a noisy estimate of the gradient
on the basis of a single sample
or small batch.
So the next question you might
ask is, how do we complete
this gradient?
And that's where
back propagation comes in.
And I'm sure many of you
are familiar with this.
Don't attempt to understand
the formula.
You don't need to.
But it's basically the idea that
to compute the gradient, which
is really the sensitivity
parameters of the cost function
with respect to all
the coefficients in the system,
all the matrices
in the weighted sums,
you can compute all of those
by doing a backward pass, which
is basically just
a practical application of chain
rule, OK?
So you know that by tweaking
a parameter in this block here,
it will affect the output
in a particular way.
And you know how tweaking
this output affects
the overall cost.
And so it's very easy by using
this sort of back propagation
method
to compute all the terms
in this gradient.
So now for every parameter
in the system,
you have a quantity that
indicates by how much the cost
will increase or decrease if you
tweak the parameter
by some given delta.
So that gives you the gradient.
Take a step
in the negative gradient
direction.
There's various tricks to make
this fast.
So what made deep learning
possible and what makes it
easy to use is you don't have
to figure this out at all.
Like, the modern deep learning
frameworks, you basically build
a network either by writing
a program in Python
or your favorite language
or by assembling blocks that
have been predefined
into a graph.
And the system automatically
knows how to compute
the gradient.
So you know, you tell it how
to compute the output.
And it kind of keeps track
of all the operations that are
done during this computation.
And then can sort of trace back
those operations so that,
automatically, the gradient
of whatever it is you're
computing with respect to all
the parameters you have will be
computed.
So that's a very simple
concept--
automatic differentiation.
But that's really what makes
deep learning so easy to deploy
and use.
OK, now, here is a problem.
If you want to apply
deep learning or neural nets
in the way I describe them
to images, it's not really
practical to view an image
as a vector where the values,
the components of the vectors
are the pixels.
Because, you know, if you have
an image that's, say,
you know, 200 by 200 pixels,
that's 40,000 pixels.
And, you know, if you multiply
this by matrix, that matrix
will be 40,000 by something.
It's gonna be large, OK?
Too large.
So you have to figure out how
to kind of specialize
the connection
between the neurons
or, basically, how to build
sparse matrices in such a way
that the computation becomes
practical.
That's the idea
of convolutional networks, which
is something that my name is
associated with.
The inspiration for this
goes back to classic work
in neuroscience from the '60s.
Actually, it's Nobel Prize
winning work by Hubel and Wiesel
on the architecture
of the visual cortex.
And there was sort
of various people
who tried to make computer
models of this,
but didn't have things
like back prop.
So what's the idea
behind a convolutional net?
You take an image
and the operation you're going
to do, the linear operation,
it's not gonna be a full matrix,
but it's gonna be what's called
a discrete convolution which
consists in taking
a little patch of the image--
5 by 5, in this case--
and then computing
the weighted sum of those pixels
with a set of 25 weights
and then putting the result
in a corresponding pixel
on the output, OK?
And then you take that window,
you shift it over a little bit
by one pixel, and do the same,
compute the dot product
or the weighted sum
of the pixels
by those coefficients
and record the result
next to it.
OK, so by swiping this window
over, you get kind of an image
at the output which is a result
of convolving this image
by this so-called convolution
kernel.
So the number of free parameters
here in your matrix
is very small.
It's only 25 in this case, OK?
And the amount of computation
is relatively small.
And the advantage of using
this kind of computation
is in situations where
the signal comes to you
in the form of an array,
you know,
a single or multidimensional
array in such a way
that the statistics are more
or less stationary
and in such a way
also that neighboring values
tend to be highly correlated
whereas far away values are not
or less, OK?
It's the case for like,
you know, financial time series,
for example, that belongs
to this, right?
And people have been using
convolutional nets
on financial time series, yes?
AUDIENCE: Can it also do edge
detection?
YANN LECUN: Say again?
AUDIENCE: [INAUDIBLE] spanning
tree and things like that.
YANN LECUN: So if you there are
certain configurations
of those coefficients that will
produce edge detection, yes.
But we're not gonna hard wire
those coefficients.
They're gonna be the result
of learning, right?
We're not gonna build them
by hand.
We're just gonna initialize them
randomly and then train
the entire thing end-to-end,
supervised to produce
a right answer at the end
on millions of examples
or thousands of examples.
And then look at the result.
OK, so that's the first layer.
And we're gonna have
multiple filters of this type.
In this case, you have four.
So each of those, you know, you
have four filters here producing
fur so-called feature maps.
And then there's a second type
of operation here called pooling
which consists in taking
the results of those filtering
small neighborhood
and pooling the results, which
means computing an average
or max of the values,
and then sub-sampling the image.
So sub-sampling means
that this image is half
the resolution of that image.
The pixels are twice as big,
if you want.
The reason for this
is to eliminate a little bit
of position information
about the location of features
in the image.
And that's important if you want
to have a system that
is robust to small deformations
with the input.
So this is a convolutional net
in action.
It has been trained to recognize
hand-written digits.
I'm not showing the output here.
And this is the input
first layer
after pooling, third layer,
another layer of pooling,
then yet another layer.
By the time you get here,
the representation is very
distributed and kind of
abstract.
But every unit here
is essentially influenced
by the entire input,
OK, so [AUDIO OUT] is
the entire input.
And the representation
of the input is this list
of those values.
You can get those things
to recognize not just
single characters,
but multiple characters
and do
simultaneous segmentation.
This is very important
because, eventually, you want
to use those systems
with natural images.
So this is sort of vintage,
you know,
early '90s convolutional net
which was built when I was
at Bell Labs.
Eventually, at Bell Labs,
we built a check reading system
based
on those convolutional nets
and various sorts of tricks.
And it was deployed
in the mid '90s.
And by the end of the '90s,
it was reading somewhere
between 10% and 20%
of all the checks in the US,
so a big success.
But by that time the machine
learning community had lost
interest in neural nets.
Nobody was working
on neural nets, essentially,
in the late '90s
until the mid 2000s roughly.
I left the industry in 2003
and joined NYU, as I was
mentioning.
And I wanted to kind of reignite
the interest of the community
for those methods,
because I knew they were
working.
And they had the reputation
of being very finicky, right?
It was, you know,
we had our own framework
for deep learning,
but nobody was interested
in this, so nobody was using
our code.
And neural nets
had the reputation of being so
hard to train that, you know,
only I and a few people
working with me were
able to train them,
which, of course, was not true.
It's just that people are lazy.
So I'm being facetious here.
So around 2003, 2004, just when
I joined NYU, I got together
with my friends Yoshua Bengio
at University of Montreal.
And Geoff Hinton at University
of Toronto
where I had done
my postdoc many years before.
And we decided to basically
start a conspiracy to renew
the interest of the community
in neural nets.
And we started
with various algorithms
that we thought would enable
backprop, perhaps, to train
very, very deep networks.
So not networks with just three
or four or five layers,
but networks perhaps with 20
layers or something like that.
And we started working
with unsupervised learning
algorithms which were only
partially successful.
But they were successful enough
to get enough interest
from people that a community
started kind of building itself.
Around 2007, there was
enough of a community
that our papers started to get
actually accepted at NIPS.
Before that, we could never
publish a paper
in any conference
on neural nets, essentially.
And then we started getting
really good results
on standard benchmarks.
But they were still kind of
dismissed to some extent.
That changed around 2009, 2010,
in speech recognition,
where the results were so much
better that people started
really kind of switching
to using neural nets
and then around 2013 in computer
vision.
And the history of this
is well known.
But, you know, in the meantime,
you know, we're starting
the mid 2000s--
I'm hearing myself now.
I started working in robotics,
you know, something
that [INAUDIBLE] is very
familiar with, a project that,
actually, that Tucker Balch was
involved in as well.
He is now at JP Morgan.
He was at Georgia Tech
at the time, still at Georgia
Tech.
And this was a project to kind
of use machine learning to get
robots to drive themselves
in nature.
This took place roughly
between 2004, 2005, 2009.
So the idea here was to use
a neural net basically
to do what's called
semantic segmentation which
means to label
every pixel in an image
with the category of the object
it belongs to.
So it's using
a convolutional net which sees
kind of a band
around the horizon of the image.
And it's trained to produce
another image, which
is this image that has
essentially three categories.
Here is something I can drive
over.
I'm going to label it green.
Or is it something that
is an obstacle?
And my video is not working
for some reason.
Oh, that's interesting.
OK.
All right, here we go.
Oh, that's fun.
All right.
Oh.
Wow, OK.
All right, this one is working.
So this is another example
of semantic segmentation
that took place a couple years
later around 2010 or so,
2009, where there were, you
know,
data sets with a few thousand
images
where people had painfully
labeled every pixel
with the category the object
belongs to, so things like road
and sidewalk and cars
and, you know, pedestrians,
trees, et cetera.
So we trained this convolutional
net to kind of be applied
to the entire image.
And it basically labels
every pixel with a category.
It makes mistakes.
It labels this as desert.
This is the middle of Washington
Square Park.
[LAUGHING]
There is no beach I'm aware of.
But, you know, at the time, that
was state of the art.
It was, in fact, quite a bit
better than the state of the art
at the time.
This was 2010.
And it was also 50 times faster
than the best runner
up competing technique.
So we submitted a paper to CVPR,
the big computer vision
conference, at the end of 2010.
Pretty sure that paper was going
to be accepted because it was
faster and better
than everything else
people had done before.
And it was rejected by all three
reviewers who said, basically,
what the hell is
a convolutional net?
And we don't believe
that a method we never heard of
could do so well.
So this could to be wrong.
I mean, essentially, that's what
the reviewers said.
And it's funny because now you
can't actually have a paper
accepted at CVPR unless you use
convolutional nets.
Oops, that's not what I wanted
to do.
Sorry about that.
Bear with me for just a second.
OK.
So your comments are really
useful.
A lot of things today for self
driving cars,
every self-driving car project
has a convolutional net in it,
and for all kinds
of other things.
I gave a talk in 2013 that gave
some ideas to people
at Mobileye, which now belongs
to Intel.
Also to Nvidia, and They're
using convolutional nets for all
their self-driving car projects.
In fact, there
is a self-driving car project
taking place in the Holmdel
building
which is the building where I
used to work at Bell Labs
by a group from Nvidia.
And the guy running this project
at Nvidia is actually
a former colleague from Bell
Labs who worked with us
on this robotics project
that Tucker was involved in.
OK, so deep learning today--
there was a revolution in 2013
in computer vision
because our friends
at University of Toronto
in Geoff Hinton's group
figured out how to implement
convolutional nets on the GPU
in a very efficient manner.
They were not the first ones
to do this.
It was done at Microsoft
in the mid 2000s.
But they applied this
to ImageNet and managed to get
results that were so much better
than what people were doing
before that.
That really created a bit
of a revolution.
So this was kind of the error
rate on ImageNet that people
were getting in 2011.
And in 2012
with the so-called AlexNet
project from Toronto.
The error rate went down
by a huge amount.
And then
over the last few years,
you know, it went down to levels
that are so low that now
this benchmark is actually not
interesting anymore.
It's, you know, better
than human performance
on this particular data set.
What we've seen simultaneously
is an inflation in the number
of layers in those networks.
So the video
of convolutional net I showed
you from the '90s earlier
had seven layers.
The one from 2013, one
of the ones that worked best
had 20 layers.
And, you know, now the best ones
have anywhere between 50 and 150
layers.
And, you know, Facebook uses
those convolutional nets very
widely for a lot
of different things.
And one of the most popular one
that's used in production
is something called ResNet-50.
So ResNet
is this particular architecture
that is here where there are
layers of convolutions
and pooling and non-linearities.
But there's also skipping
connections that can allow
the system to sort of fail
gracefully.
If some layers don't learn
appropriately, then they become
transparent, essentially.
And so that is what enables us
to kind of train a very, very
deep network.
This is an idea that came
from Kaiming He,
who was, at the time,
at Microsoft Research Asia who
is now at Facebook.
And so that's a graph that was
put by Alfredo Canziani who
is a postdoc with me at NYU.
But he did this before he came.
On the y-axis,
you have accuracy.
On the x-axis, the number
of operations, which is billions
of operations
that are necessary to compute
one output.
And what people have been trying
to do in the industry is kind
of, you know, bring everything
down to this corner,
essentially, where
for the minimum amount
of computation,
you have the best accuracy
on ImageNet or similar things.
And so ResNet-50 is right here.
There are better results now.
And then the size of the bubble,
if you want, is the sort
of memory footprint, the number
of parameters that are
necessary.
There's a lot of work
on optimizing, running
those networks
on regular processors
or specialized processors
to save power.
And the reason this is important
is, to give you an idea,
Facebook users upload somewhere
between two
and three billion photos
on Facebook every day.
And this is just
on the Blue site.
I'm not talking about Instagram
or anything like that, just
Facebook.
Every single one of those three
billion photos goes through six
convolutional nets,
roughly half a dozen,
within two seconds of being
uploaded.
And those do things like
essentially represent the image
into a feature vector that can
be used for all kinds of things,
retrieval, search, indexing,
feature vector
for other purpose,
generic feature vectors,
if you want.
And one that does
face recognition and face
detections.
Another one that generates
captions that describes
the images for the visually
impaired.
And there is a couple that
basically detect
objectionable content, you know,
nudity, violence, you know,
things like that.
So the advantage
of deep learning
is that the system basically
spontaneously learns
to represent images
in a hierarchical way
from sort of low level features
like edges
to kind of parts of objects
and motifs and things like that.
One trend
over the last few years
is the use of weakly supervised
learning
or semi-supervised learning.
This is weakly supervised
learning.
So this is an experiment that
was on Facebook where one
of the applied computer vision
groups which consisted in taking
3 and 1/2 billion images
from Instagram
and then training
a convolutional net
to predict the hashtags
that people tag images with.
So they decided on about 17,000
different hashtags that
correspond to a kind
of physical concepts,
if you want.
And then run through 3.5 billion
images
through the convolutional net
asking you to predict which
of those 17,000 hashtags
is present.
Then you take this network,
chop up the last layer that
predicts the hashtags,
and just use
the second last layer as kind
of a feature vector, which
is an input to a classifier
that you train on other tasks,
like, say ImageNet.
And you can actually beat
the record on ImageNet this way.
OK, so until fairly recently,
it was actually beaten
by another team at Facebook.
But until fairly recently
the record on ImageNet
was held by this system, which
is trained on a different task
that the one you actually
finally train it on.
So that points towards something
that is going to become more
and more important
in the future, which
is this idea that you kind
of pre-train
with lots of data
in a relatively task-independent
way.
And then you use a relatively
small amount of data
to actually train your system
to solve the task you want
to solve.
And I'll come back to this
afterwards.
So a lot of progress
over the last few years
in computer vision
using convolutional nets with--
I'm not gonna
go into the details of how this
is built, but you can get
results like this where
every object in an image
is outlined and identified.
That's called
instant segmentation.
And, you know, you can detect
wine glasses and backpacks
and count sheeps.
And it's optimized-- my videos
aren't running for some reason.
It's optimized to the point
that you can run those things
in real time on smartphones.
So this is, unfortunately,
a video that you can't see.
And I'm not sure why you can't
see it.
Oh, wow, it disappears, which
is a person essentially being
tracked, people being tracked
on a smartphone in real time
at something like 10 frames
per second.
So, you know, a lot of work
has gone
into those optimizations
to run on small platforms.
And on iPhone, also, you have
acceleration libraries.
This is all open source.
So if you want to play
with computer vision, the latest
systems, you can just download
this.
This is using
the PyTorch framework, which
also was developed at Facebook.
And there is similar things
for kind of tracking body poses
[INAUDIBLE].
ConvNets are used for all kinds
of stuff in medical imaging.
It's actually one of the hottest
topics now in radiology, which
is how you use deep learning
for analyzing medical images.
This is a project which I'm not
involved in, but it's colleagues
at NYU who've been
at the medical school
and in the computer science
department who have been
developing those architectures
for analyzing MRI images of hips
and getting really good results
with this.
So this is a really hot topic
now.
And it's probably going to have
a big effect on, you know,
radiology in the future.
OK, you know, but I don't want
to do a laundry list
of applications
of convolutional nets.
This is one that was also
developed at Facebook
for translation.
It's a little complicated
to explain here.
It's
a so-called gated convolutional
net.
But basically, the input
is a sequence of words.
And the output is also
a sequence of words
in a different language,
OK, a translation.
And that goes
through convolutions that
include something called
attention circuits.
And there is some sort of module
in the middle that sort of tries
to kind of match, kind of warp
the sequence so
that words appear
in the right place in the output
sequence.
This had the record on some data
set for a short time.
They have since been overrun.
And you know, you can use them
for sound generation
or for sequence generation here.
So this is kind of, you know,
generating synthetic sounds
by specifying what type of sound
you want.
This is a project that was done
at Facebook in Paris.
And interesting projects in
unsupervised learning
for translation.
So this is a project also that
was done in Paris, mostly, party
in New York, where you feed
a so-called unsupervised
embedding system, so you can
learn vector representations
for words in a language
by figuring out in which
contexts they appear.
This is a very classic technique
called [INAUDIBLE].
This doesn't use what-- they use
something different.
But it's very similar.
So with this technique,
in a completely unsupervised
manner, you give a big corpus
of text in one language
to a system.
And it figures out a vector
representation for each word
in such a way
that similar vectors correspond
to similar words,
essentially, depending on which
context they appear in.
You do this
for several languages.
And then you ask the question,
is there a simple mapping that
will take the cloud of points
corresponding to all the vectors
in one language
and transform it into the cloud
of points of another language.
And if you can find such
a mapping, there is some chance
that, you know,
you'll find a mapping
between the two languages.
And this actually works.
And so what this allows you
to do is basically build
a translation system from one
language to another
without having any parallel text
of those two languages,
which is, you know,
dumbfounding to me.
But it works, OK?
You know, it doesn't give you
kind of record breaking results
if you had data,
but it's amazing.
And it's very
important for Facebook
because people use thousands
of different languages
on Facebook.
In fact, we just open
sourced something which
is actually not
this particular project--
I mean, this project
is open source, too--
where we provide
embeddings for words
and sentences
in various languages,
in 92 different languages.
Actually, that's open source.
Oh, that's nice.
OK.
All right.
OK.
Question answering, I'm going
to skip this.
OK, so lots of applications
of deep learning
and convolutional nets.
A whole new set of applications,
potential applications that are
trying to pop up,
which are enabled by sort
of a new type of neural net
which instead of being applied
to, basically,
multidimensional array data,
you know, things like images
or things like that, you can now
apply neural nets to graph data,
so data that comes to you
in the form of a graph
with values on it,
a function on a graph, if you
want.
And the graph doesn't need to be
static.
I want to point you to a review
paper
that I'm a distant co-author on.
"Geometric Deep Learning-- Going
Beyond Euclidean Data."
So this is the idea of how can
you define things
like convolutional nets
and things
like this on data that is not
an array, but basically
a function on a graph.
And the cool thing about this
is that you can apply this
to social networks,
regulatory networks, networks
of--
I don't know--
financial instruments,
let's say, 3D shapes,
functional networks in biology,
things like that.
There's essentially three types.
You know, there's sort
of classical ConvNets
where the input is known.
It's a grid.
You know, it's a function
on a grid,
if you want, like an image,
for example, you can think
of as a function on a grid.
Things where the graph is fixed.
For example, the graph
of interactions
between different areas
of the brain.
But the function on the graph
is not fixed.
And so you'd like to apply
convolutional nets to domains
of this type.
You know, how do you define
a convolution on kind
of a such a funny graph.
And then there are applications
where the graph changes
for every new data.
Right, so, for example, the data
point could be a molecule.
A molecule is best represented
as a graph.
Can we run a neural net
on a graph?
And the answer is yes.
And I think this whole area
opens an entire Pandora's box
of new applications
of neural nets
that are heretofore unforeseen.
And so I think it's really cool.
Last year I co-organized
a workshop at IPAN,
the Institute for Pure & Applied
Mathematics at UCLA,
on new techniques
in deep learning.
And there was a lot of talks
about this.
So if you want to kind of learn
about this, that's a good way
to get started.
OK, now there's been a lot
of excitement
about reinforcement learning,
particularly deep reinforcement
learning, in the last few years.
Everybody has sort of AlphaGo
and things like that.
And reinforcement only works
really well for things
like games.
So if you want to train
a machine to play Doom or play
Go or play chess, StarCraft not
so much yet, reinforcement
learning works really well.
So reinforcement learning is
the idea that you don't tell
the machine the correct answer,
you only tell the machine
whether it did good or bad,
right?
So you let the machine produce
an answer.
In this case, an action
or an action sequence, and then
you tell it, you know, you won
or you lost or, you know, you
did good, you gained points
or you didn't.
So it works amazingly well,
except that it requires
many, many, many interactions
with the environment.
So a few people were thinking
that, so it works really well
for Go, for example.
So this is a Go system that
is actually produced
at Facebook, which
is similar to AlphaGo
and Alpha0, which works
at superhuman level
and everything.
We're working also
on a similar project
with StarCraft where we train,
you know, a StarCraft agent
to kind of win the battle.
The big problem that I was just
mentioning is that reinforcement
learning is very
inefficient in terms of samples.
So this is a figure
from a recent paper
from "DeepMind" where they kind
of measure as a function
of the number of millions
of frames
that the system sees for playing
an Atari game.
So this a classic Atari games
from the 1980s.
Using the best algorithms,
it takes roughly seven million
frames to reach a performance
that humans will reach
in a few minutes.
And that corresponds to,
you know, something like 100
hours of play, if you kind of
translate this into real time.
So these systems are much, much
slower than humans or animals,
for that matter, at kind
of learning new skills.
And that's why they are not
really practical for sort
of real world application
for which there is
no gigantic amount
of interactions that are
accessible.
So if you want to use
reinforcement learning to train
a car to drive itself,
it's basically not gonna work
in its current form.
You know, machine will have
to, you know, drive off a cliff
several thousand times before it
figures out how not to do it.
Now, how is it that humans are
able to learn to drive
a car in about 20 hours
of training without crashing?
It's kind of amazing.
I mean, this would require,
you know, hundreds of thousands,
if not millions of hours
of training to get a car
to drive itself.
You could do this in simulation,
but, you know, simulations are
not accurate.
And people are working on how
you can transfer from simulation
environment to the real world.
Yeah, this is just in passing,
a list of sort
of major open source projects
that Facebook research has put
out.
So PyTorch is the environment
we use for deep learning.
Faiss is a very fast similarity
search library for nearest
neighbor.
It's very useful.
This is used everywhere
within Facebook.
This stands for dialogue.
There's this reinforcement
learning framework for Go.
OpenGo is the system I just
mentioned.
FastText for natural language
understanding.
Fairseq for sequence processing,
things like translation
and things like that.
There's a whole bunch
of projects coming up.
You can get them
all from this GitHub here--
github.com/facebookresearch.
OK, so obviously, you know,
we can't get our machines to run
as fast as humans,
so we're missing something
really essential
here to get to real AI.
And in my opinion, we're missing
three things.
One thing we're missing
is the ability of learning
machines to reason.
So right now all
the applications I've shown
you is for perception.
And for perception,
deep learning works amazingly
well.
It can learn to represent
the perceptual world really
well.
But, you know, kind of learning
to reason,
that's more difficult.
There are a lot of ideas,
some work on it,
but I don't think we have
the answer to that.
The second problem is learning
models of the world.
So the reason, perhaps, that we
are able to learn to drive a car
with 20 hours of training
without crashing
is that we can predict
the effect of our actions.
We can predict what's going
to happen in the world
before it happens.
The whole front part
of our brain, basically,
is a prediction engine.
And our machines don't really
have the ability of, basically,
predicting.
Not that they don't have it,
we can train them to predict
in certain ways, but there are
technical difficulties which I
will come to in a minute.
And the last thing, which I'm
not going to talk about,
is the ability to learn not just
hierarchical representations
of the perceptual world,
but hierarchical representations
of the sort of action world.
When we decide to go from here
to Atlanta, OK, we have to sort
of decompose that task
into sub-tasks all the way down
to kind of millisecond
by millisecond control
of our muscles.
And so we have this sort
of hierarchical representation
of action sequences.
And we don't really know how
to do this automatically
with machine learning today.
But I'm not gonna talk
about this.
So it's a big problem
because, you know,
we can have all
those cool things that we can
build with deep learning.
But we can't have those things,
which is what we really want.
We like to have machines
with common sense,
you know, a dialogue system
that we can talk to
and doesn't have like a very
narrow set of things
it can do for us, right?
Like just, you know, play music
and giving us the weather
and the traffic.
You know, we'd like machines
to help us in our daily lives
the way a human assistant would.
So we want to build things
like intelligent personal
assistants and we won't have
that until we have machines that
have some level of common sense.
We'd like to have household
robots that are
agile and dexterous, you know.
We don't have that.
We don't have robots that are
nearly as agile
and have nearly as much
common sense as a house cat.
You know, with all
their superhuman performance
in Go and everything.
So that's, you know, that's what
we need to think of, like,
what's the next step?
OK, so about reasoning,
there is sort of an avenue which
is interesting because it might
lead to sort
of a new way of doing computer
science, essentially, which
is the idea
of differentiable programming.
And it's the idea that when you
build a deep learning system,
you don't actually build a graph
of modules
anymore in frameworks
like PyTorch, you just write
a program.
And the purpose of this program
is just to compute the output
of your neural net.
And every call of functions
in this program is like a module
that you can differentiate.
And so essentially, it's
a new way of writing software
where, you know, when you write
the program, the function
of each instruction
is not entirely specified
until you train the program
to actually do the right thing
from examples.
OK, so it's
like a weakly specified program,
essentially.
So it's called
differentiable programming
because it's the idea that, you
know, you write programs.
So essentially,
a neural net architecture
is really a program
like an algorithm whose function
is not completely finalized
until you train it.
And there's lots of really
interesting stuff
that kind of are viewed
in this context.
For example, the idea of memory
augmented neural nets.
So the idea that you have
a neural net and you
attach to it something that
works like a memory,
like an associative memory
that this thing can use
as a working memory
to, you know,
do things like reasoning,
long chains of reasoning.
Or maybe it can use the memory
to store factual knowledge.
You know, like, you know
relationships between knowledge
bases between objects, you know,
objects and relations, things
like that.
There's quite a bit of work
on this.
Again, I don't think we have
the complete answer.
But it's interesting.
Here's another example.
This is a interesting project
where you'd like a system
to be able to answer questions
like that here, so you show it
an image of this type.
And you tell it there
is a shiny object that is right
of the gray metallic cylinder.
Does it have the same size
as the large rubber sphere?
And for us to answer
that question, we kind of
have to configure
a visual system to basically
detect the shiny objects
and the gray metallic cylinder.
You know, we have this strategy,
we detect
the gray metallic cylinder.
And then we look for objects
nearby that are shiny.
And then we compare sizes,
right?
And so the idea
behind this project,
which is at Facebook in Menlo
Park, is you have a neural net
that reads the sentence.
And what it does
is that it generates
another neural net whose only
purpose is to answer
that particular question
from an image.
So the modules here are kind
of dynamically wired, if you
want, depending on the input.
So it's one of those examples
of a dynamic neural net whose
structure is [INAUDIBLE].
That's the essence
of differential programming.
Software 2.0, some people have
called it this way.
So PyTorch was really kind of
designed from the start
with this idea
that you could have
dynamic neural nets, not quite
the case with TensorFlow,
which is the Google framework.
But TensorFlow is kind
of catching up.
They're trying to do
the same thing.
So how do humans and animals
learn?
You look at babies
in the first few days
and weeks of life, months
of life, and they learn
an amazing amount of background
knowledge about the world
just by observation.
Babies are kind of helpless,
you know, their actions are very
limited.
But they can observe.
And they learn a lot
by observing.
So if, you know,
you play a trick on a baby
before the age of six months
or so, you show the baby,
you know, put a toy
on a platform and push the toy
off.
And there's a trick that,
you know,
makes it such that the toy
doesn't fall.
Before six months, the baby
doesn't pay attention to this.
They are sure that's
how the world works, no problem.
After eight or nine months,
you show this scenario to a baby
and she goes like this.
Because, you know,
in the meantime, she has learned
that an object is not supposed
to float in the air.
It's suppose to fall if it's not
supported.
So she's learned the concept
of gravity,
you know, in between, intuitive
physics and things like that,
inertia.
In fact, there was this chart
that was put together
by Emmanuel Dupoux, who
is a cognitive neuroscience
in Paris who spends part
of his time at Facebook,
which is kind of, you know, when
babies learns
basic concepts of this type,
you know, gravity, inertia,
you know,
happens around seven
months or so, seven or eight
months.
And, you know, object permanence
is an important one that pops up
very early, the difference
between animate and inanimate
objects also appears quite
early.
So, you know,
we learn those things just
by observation.
It's not
in a task-dependent way.
And this is what allows us
to predict what's gonna happen
in the world.
We have a very good model
of the world
that we learn since we're born
just by observation.
And we're not the only ones.
Animals also have good models
of the world.
Here is a baby orangutan here
is being played a magic trick.
There was an object in this cup.
The object was removed
but he didn't see it.
Now the cup is empty.
And he's rolling on the floor
laughing.
You know, his model of the world
was violated.
And so it causes you to do one
of two or three things
when your model of the world
is violated, you laugh, you get
scared because maybe something
dangerous is gonna happen
that you didn't predict,
in any case, you pay attention.
All right, so I think the way
to push that problem is
through what I call
self-supervised learning.
And it's basically the idea
that for the system
to be able to learn
from raw data
just by observation, what you're
gonna do
is you're going to feed
a piece of data to the system,
let's say a video clip, OK?
And you're gonna
tell the system,
pretend you know a piece
of this input
and pretend you don't know this
and try to predict this piece
that you pretend you don't know.
And then I'm going to show you
this piece and you can correct
your internal parameters to make
the prediction that actually
occurred.
OK, so, for example, I show
you a piece of a video clip
and I ask you to predict how
the clip is going to continue
the next few friends
in the video clip.
And then I show you the frames
and, you know, [INAUDIBLE].
But it's not just predicting
the future.
It could be predicting the past.
It could be predicting the top
from the bottom, you know,
whatever, the piece
of the input.
So there's really sort
of those three types
of learning, reinforcement
learning where the feedback
to the machine
is very weak informationally.
It's just one scalar value that
tells the machine whether it did
good or bad once in a while.
Supervised learning you give
more information to the machine,
you tell it what
the correct answer is.
But it's still not very strong
because all that data has to be
curated by humans.
And so it's limited
in the amount.
And then there is this kind
of self-supervised predictive
learning idea where the amount
of data that the machine is
asked to predict
and the amount of data it's
given to train
is absolutely enormous.
You know, just an hour video
is like a ridiculously large
amount of data
that you're asking the machine
to predict, you know,
every future frame
from every past frame,
for example.
So, you know, Geoff Hinton
made this argument many years
ago that if you have a very
large learning system like say
a brain that has 10 to the 14
parameters, free parameters
for the synaptic connections,
you need a lot of data
to constrain the system to learn
anything useful.
And that's pretty much
the only way, kind of predicting
everything from everything else,
essentially.
We're not gonna do this
with supervised learning
or reinforcement learning.
That led me to this kind
of certainly obnoxious analogy
here that if, you know,
the stuff we can learn,
our intelligence, is a cake.
The bulk of the cake
is self-supervised learning.
Almost everything we learn,
we learn just
in self-supervised fashion.
We learn a little bit
with supervised learning.
And we learn a tiny amount
through reinforcement learning.
So that would be the cherry
on the cake.
People working in reinforcement
learning get a little upset when
I show this.
But it's become a bit of a meme
now in the machine learning
community.
OK, you know, this doesn't mean
reinforcement learning is not
interesting.
It's necessary.
This is a Black Forest cake.
And Black Forest cake
has to have a cherry.
Actually, it has cherries inside
even.
But it's really not where we
learn most of our knowledge.
So, yeah, with things like image
in-painting for example
is an example
of self-supervised learning.
And people are working on this
in computer vision.
So the next revolution in AI
is not going to be supervised,
that's for sure.
OK, so let's say we want
to build predictive models
of the world.
So it's the very classical thing
in optimal control.
And I'm sure some of you
may have a background
in this kind of stuff.
There's a system you want
to control, which
optimal control people call it
"plant."
And you have an objective you
want to minimize or maximize,
in your case.
And you can run your simulator
forward, and then figure out
an optimal sequence of commands
that will optimize
your objective, given
your predictive model.
OK, and that's a classical thing
in optimal control.
And in fact, that should be
a classical thing
in the architecture
of an intelligent system.
An intelligent system should
have a way of kind of predicting
what's going to happen before it
happens to avoid doing
stupid things like running off
a cliff, right?
And we don't run off cliffs,
even if we don't know how
to drive, mostly, because, you
know, we have this ability
to predict the consequence
of our actions.
So we need this world simulator
in an intelligent agent as well
as other components
that I'm not gonna talk about.
So how do we learn
predictive models of the world?
You know, we can observe
the state of the world,
at least partially
through observation
and we can train a function
to predict what the next state
is going to be.
And then we're going to observe
what the next state is going
to be and we just train
our system
in a supervised manner
to do this.
So this is something that some
of my colleagues at Facebook
have tried to do a few years ago
where, you know,
you have those kind of scenarios
where you put a stack of cubes
and you let the physics operate.
And the cubes fall.
And the predictions you get,
this is predictions produced
by a convolutional net,
the predictions you get
are blurry.
Because the system cannot
exactly predict what's going
to happen.
There is an uncertainty
about what's going to happen
to this tower.
And so you get
those blurry predictions.
If I take a pen
and I put it on the table
and I let it go, you can predict
that it's going to fall.
But you probably can't predict
in which direction
it's going to fall.
And so that's a problem.
Because we have to be
able to get machines to learn
in the presence
of large uncertainties.
So there is the pen example.
And the only way we can do this
is through models that have
latent variables.
So basically, we observe
the past, you know, the clip
where I put the pen
on the table, and we're gonna
make a prediction.
And what we'd like
is make multiple predictions
depending on the circumstances.
And so what we're gonna need
a set of extra variables,
latent variables, that we can
observe,
and when we vary those variables
of this vector,
it makes the prediction vary
among all
the possible predictions that
may occur, OK?
Let's call it a latent variable
model.
And a good example of how to do
this is adversarial training.
So adversarial training says I'm
going to sample this latent
variable randomly.
And now what I need to train
this predictor is something that
tells me whether my prediction
is on this set
of plausible futures
or whether it's outside.
OK, but of course,
I don't have
any characterization of the set
of possible futures,
so I'm going to train
a second neural net to tell me
whether I'm
on this manifold or outside.
OK, that's called
a discriminator in the context
of adversarial training.
And you can think of it
as a trainable loss function,
a trainable objective function,
basically.
The objective function is,
it tells you how far you are
from this manifold and there's
a gradient of it that points you
towards the manifold.
So that's the idea
of adversarial training.
Let's say you want
to video prediction,
you show the system a piece
of video.
And, of course, in your data
set, you know what the video is
going to do.
That's the real data.
But then you run this
through your generator, which,
you know, from a source
of random vectors,
tries to predict what's gonna
happen.
And initially, it's not trained
properly.
It's gonna make,
you know, a bad blurry
prediction or something
like that.
So now you train
your discriminator,
your function that tells you
whether you are on the manifold
or not, of data.
You train it to produce
low values for this
and high values for that.
So that's kind
of a representation of what
this discriminator is doing.
And what it's going to try to do
is the green points that come
from here that are not
on the manifold of data
is going to try to kind of push
up the output here.
And for the real ones, they're
going to try to push them down.
Those are the blue spheres.
And so this function is going
to take that shape.
And then what you're going to do
is, use the gradient
of that function with respect
to its input to train
this generator to produce images
that this guy can't tell
are fake.
OK, so now what you have is
an objective function
in the discriminator that can
tell the generator
you are
on the manifold or outside.
It can use the gradient of that
back propagated
through the generator
to train you to do
the right thing.
And eventually, it makes
decent predictions.
People have used this.
I mean, these kind of techniques
now have taken over the field,
basically.
A lot of people
are working on this
for all kinds of stuff,
generating synthetic images.
This is work from a few years
ago.
These are fake faces.
This is work from Nvidia
in Finland.
And they trained a system
to transform a bunch
of random numbers
into a face image.
They trained it on a database
of photos of celebrities.
And at the end, you feed a bunch
of random numbers, and out comes
an image of face.
And these are synthetic faces
at high resolution.
You can't tell they're fake.
But none of those people exist.
At Facebook we've been working
on similar techniques
to do things like generating
fashion elements.
So it's in France, so you know?
So we got a big data set
from a very famous designer
house and sort of trained one
of those generating networks
on this.
And these are examples
of generations.
And this is not textures
that like humans would come up
with, essentially.
OK, I'm going to talk
a little bit
about video prediction.
So video prediction
is interesting because, you
know, in the context
of self-driving cars,
for example,
you'd like a self-driving car
to be able to predict what
the car around it are doing,
right?
I realize I'm out of time, so--
AUDIENCE: [INAUDIBLE]
YANN LECUN: And this
is a project that we've done
with people at Nvidia.
We are trying to predict what
cars around us
are going to do and then use
this predictive model to sort
of learn a good driving policy.
So basically, we feed the system
with a few frames of what
the environment of cars
looks like around us.
And we train the system
to predict what the cars around
are gonna do.
And we use data that comes
from an overhead camera
for this.
And so these are examples
of predictions.
So this is if you have
a completely deterministic
system that doesn't have
any latent variable
and basically makes
those blurry predictions.
And these are predictions that
occur if you have a system
with latent variables in it.
And I don't have time
to go into the details of how
it's built. And then you can
build a system to, basically,
train the system to run
a driving policy.
So you start from a real state.
You run your predictive model
forward.
You can compute the cost, which
is how close you are, how far
you are from other cars,
whether you are in the lane
or not.
And by back propagation,
you can learn a policy network
that learns to produce an action
that will minimize
the probability of collisions
over the thing.
And if you do this,
it doesn't work.
But if you add a term
in the cost that indicates how
certain the system is
of its prediction, then it
works.
And so I'm just going to end
with a cute video here.
So the blue car is driving
itself, basically,
and the white point indicates,
you know, whether the car is
accelerating, decelerating,
turning.
The other cars are real cars
around it that are just
recorded.
And so our own car
is invisible to them.
So it's like we're driving
on a highway, but nobody sees
us, right?
And so we can get squeezed
between two cars, basically,
and there's nothing we can do.
But this thing kind of learns
how to kind of merge
on the highway and, you know,
things like that.
OK, so I'm going to end here.
Just remind you,
you know, there's
interesting areas of research
in deep learning in things
like graph of structured data,
reasoning,
self-supervised learning,
learning
hierarchical representation
control space.
We need more theory.
And maybe there is a new type
of computer science emerging
through differentiable
programming.
Thank you very much.
