hi everyone and welcome back to MIT
6.S191 today we're going to be
talking about one of my favorite topics
in this course and that's how we can
give machines a sense of vision
now vision I think is one of the most
important senses that humans possess
sighted people rely on vision every
single day from things like navigation
manipulation how you can pick up objects
how you can recognize objects recognize
complex human emotion and behaviors and
I think it's very safe to say that
vision is a huge part of human life
today we're going to be learning about
how deep learning can build powerful
computer vision systems capable of
solving extraordinary complex tasks that
may be just 15 years ago would have not
even been possible to be solved now one
example of how vision is transforming
computer or how deep learning is
transforming computer vision is is
facial recognition so on the top left
you can see an icon of the human eye
which visually represents vision coming
into a deep neural network in the form
of images or pixels or video and on the
output on the bottom you can see a
depiction of a human face or detecting
human face but this could also be
recognizing different human faces or
even emotions on the face recognizing
key facial features etc now deep
learning has transformed this field
specifically because it means that the
creator of this AI does not need to
tailor that algorithm specifically for
towards facial detection but instead
they can provide lots and lots of data
to this algorithm and swap out this end
this end piece instead of facial
detection they can swap it out for many
other detection types or recognition
types and the neural network can try and
learn to solve that task so for example
we can replace that facial detection
task with the detection of disease
region in the retina of the eye and
similar techniques could also be applied
throughout healthcare matters
and towards the detection and
classification of many different types
of diseases in biology and so on now
another common example is in the context
of self-driving cars where we take an
image as input and try to learn an
autonomous control system for that car
this is all entirely end-to-end so we
have vision and pixels coming in as
input and the actuation of the car
coming in as output now this is
radically different than the vast
majority of autonomous car companies and
how they operate so if you look at
companies like way Moe and Tesla this
end-to-end approach is radically
different we'll talk more about this
later on but this is actually just one
of the autonomous vehicles that we build
here as part of my lab at csail so
that's why I'm bringing it up so now
that we've gotten a sense of at a very
high level some of the computer vision
tasks that we as humans solve every
single day and that we can also train
machines to solve the next natural
question I think to ask is how can
computers see and specifically how does
a computer process an image or a video
how do they process pixels coming from
those images well to a computer images
are just numbers and suppose we have
this picture here of Abraham Lincoln
it's made up of pixels now each of these
pixels since it's a grayscale image can
be represented by a single number and
now we can represent our image as a two
dimensional matrix of numbers one for
each pixel in that image and that's how
a computer sees this image it sees that
that's just a matrix of two-dimensional
numbers or two-dimensional matrix of
numbers rather now if we have an RGB
image a color image instead of a
grayscale image we can simply represent
that as three of these two-dimensional
images concatenated or stacked on top of
each other one for the red channel one
for the green channel one for the blue
channel and that's RGB now we have a way
to represent images to computers and we
can think about what types of computer
vision tasks this will allow us to solve
and what we can perform given this this
foundation well two common types of
machine learning that we actually saw in
lecture 1 and 2 yesterday are those of
classification and those of progression
in regression we take
we have our output take a continuous
value in classification we have our
output take a continuous label so let's
first start with classification and
specifically the the problem of image
classification we want to predict a
single label for each image for example
we have a bunch of US presidents here
and we want to build the classification
pipeline to determine which President is
in this image that we're looking at
outputting the probability that this
image is each of those US presidents in
order to collect correctly classify this
image our pipeline needs to be able to
tell what is unique about a picture of
Lincoln versus what is unique about a
picture of Washington versus a picture
of Obama it needs to understand those
unique differences in each of those
images or each of those classifications
each of those features now another way
to think about this and this image
classification pipeline at a high level
is in terms of features that are
characteristics of a particular class
classification is done done by detecting
these types of features in that class if
you detect enough of these features
specific to that class then you can
probably say with pretty high confidence
that you're looking at that class now
one way to solve this problem is to
leverage knowledge about your field your
domain knowledge and say let's suppose
we're dealing with human faces we can
use our knowledge about human faces to
say that if we want to detect human
faces we can first detect noses eyes
ears mouths and then once we have a
detection pipeline for those we can
start to detect those features and then
determine if we're looking at a human
face or not now there's a big problem
with that approach and that's that
preliminary detection pipeline how do we
detect those noses ears mouths and like
this hierarchy is kind of our bottleneck
in that sense and remember that these
images are just three dimensional arrays
of numbers well actually they're just
three dimensional arrays of brightness
values and that images can hold tons of
variation so there's variations such as
occlusions that we have to deal with
variations in illumination and even
intro class very
and when we're building our
classification pipeline we need to be
invariant to all of these variations
it'll and be sensitive to inter class
variation so sensitive to the variations
between classes but invariant to the
variations within a single class now
even though our pipeline could use the
features that we as humans defined the
manual extraction of those features is
where this really breaks down now due to
the incredible variability in image data
specifically the detection of these
features is super difficult in practice
and defining the manually extracting
these features can be extremely brittle
so how can we do better than this
that's really the question that we want
to tackle today one way is that we want
to extract both these visual features
and detect their presence in the image
simultaneously and in a hierarchical
fashion and for that we can use neural
networks like we saw in lab in class
number one and two and our approach here
is going to be to learn the visual
features directly from data and to learn
a hierarchy of these features so that we
can reconstruct a representation of what
makes up our final class label so I
think now that we have that foundation
of how images work we can actually move
on to asking ourselves how we can learn
those visual features specifically with
a certain type of operation in neural
networks and neural networks will allow
us to directly learn those features from
visual data if we construct them
cleverly and correctly in lecture one we
learned about fully connected or dense
neural networks where you can have
multiple hidden layers and each hidden
layer is densely connected to its
previous layer and densely connected
here let me just remind you is that
every input is connected to every output
in that layer so let's say that we want
to use these densely connected networks
in image classification what that would
mean is that we're going to take our
two-dimensional image right it's a
two-dimensional spatial structure we're
going to collapse it down into a one
dimensional vector and then we can feed
that through our dense Network so
every pixel in that that one dimensional
vector will feed into the next layer and
you cannot already imagine that or you
can you should already appreciate that
all of our two-dimensional structure in
that image is completely gone already
because we've collapsed a
two-dimensional image into one dimension
we've lost all of that very useful
spatial structure in our image and all
of that domain knowledge that we could
have used a priori and additionally
we're going to have a ton of parameters
in this network because it's densely
connected we're connecting every single
pixel in our input to every single
neuron in our hidden layer so this is
not really feasible in practice and
instead we need to ask how we can build
some spatial structure into neural
networks so we can be a little more
clever in our learning process and allow
us to tackle this specific type of
inputs in a more reasonable and and
well-behaved way also we're dealing with
some prior knowledge that we have
specifically that spatial structure is
super important in image data and to do
this let's first represent our
two-dimensional image as a array of
pixel values just like they normally
were to start with one way that we can
keep and maintain that spatial structure
is by connecting patches of the input to
a single neuron in the hidden layer so
instead of connecting every input pixel
from our input layer and our input image
to a single neuron in the hidden layer
like with dense neural networks we're
going to connect just a single patch a
very small patch and notice here that
only a region of that input layer or
that input image is influencing this
single neuron at the hidden layer to
define connections across the entire
input we can apply the same principle of
connecting patches in the input layer in
single neurons in the subsequent layer
we do this by simply sliding that patch
window across the input image and in
this case we're sliding it by two units
each time in this way we take into
account we maintain all of that spatial
structure that spatial information
inherent to our image input but we also
remember that the final task that we
really want to do here that I told you
we want to do
was to learn visual features and we can
do this very simply by waving those
connections in the patches so each of
the patches instead of just connecting
them uniformly to our hidden layer we're
going to weight each of those pixels and
apply a similar technique like we saw in
lab 1 instead of we can basically just
have a weighted summation of all of
those pixels in that patch and that
feeds into the next hidden unit in our
hidden layer to detect a particular
feature now in practice this operation
is simply called convolution which gives
way to the name convolutional neural
network which we'll get to later on
we'll think about this at a high level
first and suppose that we have a four by
four filter which means that we have 16
different weights 4 by 4 we are going to
apply the same filter of four by four
patches across the entire input image
and we'll use the result of that
operation to define the state of the
neurons in the next hidden layer
we basically shift this patch across the
image we shifted for example in units of
two pixels each time to grab the next
patch we repeat the convolution
operation and that's how we can start to
think about extracting features in our
input but you're probably wondering how
does this convolution operation actually
relate to feature extraction so so far
we've just defined the sliding operation
where we can slide a patch over the
input but we haven't really talked about
how that allows us to extract features
from that image itself so let's make
this concrete by walking through an
example first suppose we want to
classify X's from a set of black and
white images so here black is
represented by -1 white is represented
by the pixel 1 now to classify X's
clearly we're not going to be able to
just compare these two matrices because
there's too much variation between these
classes we want to be able to get
invariant to certain types of
deformation to the images scale shift
rotation we want to be able to handle
all of that so we can't just compare
these two like as they are right now so
instead what we're gonna do
is we want to model our model to compare
these images of exes piece by piece or
patch by patch and the important patches
are the important pieces that it's
looking for are the features now if our
model can find rough feature matches
across these two images then we can say
with pretty high confidence that they're
probably coming from the same image if
they share a lot of the same visual
features then they're probably
representing the same object
now each feature is like a mini image
each of these patches is like a mini
image it's also a two-dimensional array
of numbers and we'll use these filters
let me call them now to pick up on the
features comment 2x in the case of X's
filter is representing diagonal lines
and crosses are probably the most
important things to look for and you can
see those on the top the top row here so
we can probably capture these features
in terms of the arms and the main body
of that X so the arms the legs and the
body will capture all of those features
that we show here and note that the
smaller matrices are the filters of
weights so these are the actual values
of the weights that correspond to that
patch as we slide it across the image
now all that's left to do here is really
just define that convolution operation
and tell you when you slide that patch
over the image what is the actual
operation that takes that patch on top
of that image and then produces that
next output at the hidden neuron layer
so convolution preserve is that spatial
structure between pixels by learning the
image features in these small squares or
the small patches of the input data to
do this to cut the entire equation or
the entire computation is as follows we
first place that patch on top of our
input image of the same size so here
we're placing this patch on the top left
on this part of the image in green on
the X there and we perform an
element-wise multiplication so for every
pixel on our image where the patch
overlaps with we element-wise multiply
every pixel in the filter the result you
can see on the right is just a matrix of
all ones because there's perfect overlap
between our filter in this case and our
image at the patch location the only
thing left to do here is sum up all of
those numbers and when you sum them up
you get nine and that's the output at
the next layer now let's go through one
more example a little bit more slowly of
how we did this and you might be able to
appreciate what this convolution
operation is intuitively telling us now
that's mathematically how it's done but
now let's see intuitively what this is
showing us suppose we want to compute
the convolution now of this 5x5 image in
green with this 3x3 filter or this 3x3
patch to do this we need to cover that
entire image by sliding the filter over
that image and performing that
element-wise multiplication and adding
the output for each patch and this is
what that looks like so first we'll
start off by placing that yellow filter
on the top left corner we're going to
element-wise multiply and add all of the
outputs and we're gonna do it four and
we're gonna place that four in our first
entry of our output matrix this is
called the feature map now we can
continue this and slide that 3x3 filter
over the image element wise multiply add
up all the numbers and place the next
result in the next row in the next
column which is three and we can just
keep repeating this operation over and
over and that's it the feature map on
the right reflects where in the image
there is activation by this particular
filter so let's take a look at this
filter really quickly you can see in
this filter this filter is an X or a
cross it has ones on both diagonals and
then the image you can see that it's
being activated also along this main
diagonal on the four where the four is
being maximally activated so this is
showing that there is maximum overlap
with this filter on this image along
this central diagonal now let's take a
quick example of how different types of
filters are changing the weights in your
filter can impact different feature Maps
or different outputs so simply by
changing the weights in your filter you
can change what your filter is
looking for what it's going to be
activating so take for example this
image of this woman Lenna
on the left that's the original image on
the left if you slide different filters
over this image you can get different
output feature Maps so for example you
can sharpen this image by having a
filter shown on the second column you
can detect edges in this image by having
the third column by using the third
columns features filter sorry and you
can even detect stronger edges by having
the fourth column and these are ways
that changing those weights in your
filter can really impact the features
that you detect so now I hope you can
appreciate how convolution allows us to
capitalize on spatial structure and use
sets of weights to extract these local
features within images and very easily
we can detect different features by
simply changing our weights and using
different filters okay now these
concepts of preserving spatial
information and spatial structure while
local feature extraction while also
doing local feature extraction using the
convolution operation are at the core of
neural networks and we use those for
computer vision tasks so now that we've
gotten convolutions under our belt we
can start to think about how we can
utilize this to build full convolutional
neural networks for solving computer
vision tasks and these networks are very
appropriately named convolutional neural
networks because the backbone of them is
the convolution operation and we'll take
a look first at a CNN or convolutional
neural network architecture define
designed for image classification tasks
and we'll see how the convolution
operation can actually feed into those
spatial sampling operations so that we
can build this full thing end to end
so first let's consider the simple very
simple CN n for image classification now
here the goal is to learn features
directly from data and to use these
learn feature Maps for classification of
these images there are three main parts
to a CNN that I want to talk about now
first part is the convolutions which we
about before these are for extracting
the features in your image or in your
previous layer in a more generic saying
the second step is applying your
non-linearity and again like we saw in
lecture 1 and 2 nonlinearities allow us
to deal with nonlinear data and
introduce complexity into our learning
pipeline so that we can solve these more
complex tasks and finally the third step
which is what I was talking about before
is this pooling operation which allows
you to down sample your spatial
resolution of your image and deal with
multiple scales of that image or
multiple scales of your features within
that image and finally the last point I
want to make here is that the
computation of class scores let's
suppose if we're dealing with image
classification can be outputted using
maybe a dense layer at the end after
your convolutional layers so you can
output a dense layer which which
represents those probabilities of
representing each class and that can be
your final output in this case and now
we'll go through each of these
operations and break these ideas down a
little bit further just so we can see
the basic architecture of a CNN and how
you can implement one as well okay so
going through this step by step those
three steps the first step is that
convolution operation and as before this
is the same story that we've going we've
been going through each neuron here in
the hidden layer we'll compute a
weighted sum of its inputs from that
patch and we'll apply a bias like in
lecture one and two and activate with a
local non-linearity know what's special
here is that local connectivity that I
just want to keep stressing again each
neuron in that hidden layer is only
seeing a patch from that original input
image and that's what's really important
here we can define the actual
computation for a neuron in that hidden
layer its inputs are those neurons in
the patch and the previous layer we
apply a matrix of weights again that's
that filter a 4x4 filter in this case we
do an element-wise multiplication add
the result apply a bias activate with
that non-linearity and that's it that's
our single neuron at the hidden layer
and we just keep repeating this by
sliding that patch
over the input remember that our
element-wise multiplication and addition
here is simply that convolution
operation that we talked about earlier
I'm not saying anything new except the
addition of that bias term before our
non-linearity so this defines how
neurons and convolutional layers are
connected but with a single
convolutional layer we can have multiple
different filters or multiple different
features that we might want to extract
or detect the output layer of a
convolution therefore is not a single
image as well but rather a volume of
images representing all of the different
filters that it detects so here at D the
depth is the number of filters or the
number of features that you want to
detect in that image and that's set by
the human so when you define your
network you define at every layer how
many features do you want to detect at
that layer now we can also think about
the connections in a neuron in a
convolutional neural network in terms of
their receptive field and the locations
of their input of that specific node
that they're connected to right so these
parameters define the spatial
arrangement of that output of the
convolutional layer and to summarize we
can see basically how the connections of
let's see so how the connections of
these convolutional layers are defined
first of all and also how the output of
a convolutional layer is a volume
defined by that depth or the number of
filters that we want to learn and with
this information this defines our single
convolutional layer and now we're well
on our way to defining the full
convolutional neural network the
remaining steps are are kind of just
icing on the cake at this point and it
starts with applying that non-linearity
so on that volume we apply an
element-wise non-linearity in this case
I'm showing a rectified linear unit
activation function this is very similar
in idea to lectures 1 & 2 where we also
applied nonlinearities to deal with
highly nonlinear data now here the
relative activation function rectified
linear unit we haven't talked about it
yet but this is just an activation
function that takes as input any real
number and essentially
ships everything less than zero to zero
and anything greater than zero it keeps
the same another way you can think about
this is it make sure that the minimum of
whatever you feed in is zero so if it's
greater than zero it doesn't touch it if
it's less than zero to make sure that it
caps it at zero now the key idea
the next key idea let's say of
convolutional neural networks is pulling
and that's how we can deal with
different spatial resolutions and become
spatially or like invariant to spatial
size in our image now the pooling
operation is used to reduce the
dimensionality of our input layers and
this can be done on any layer after the
convolutional layer so you can apply on
your input image a convolutional layer
apply a comp a non-linearity and then
down sample using a pooling layer to get
a different spatial resolution before
applying your next convolutional layer
and repeat this process for many layers
and a deep neural network now a common
technique here for pooling is called max
pooling and when and the idea is as
follows so you can slide now another
window or another patch over your
network and for each of the patches you
simply take the maximum value in that
patch so let's say we're dealing with
two by two patches in this case the red
patch you can see on the top right we
just simply take the maximum value in
that red patch which is six and the
output is on the right-hand side here so
that six is the maximum from this patch
this 2x2 patch and we repeat this over
the entire image this makes us or this
allows us to shrink the spatial
dimensions of our image while still
maintaining all of that spatial
structure so actually this is a great
point because I encourage all of you to
think about what are some other ways
that you could perform a pooling
operation how else could you down sample
these images max pooling is one way so
you could always take the maximum of
these these 2x2 patches but there are a
lot of other really clever ways as well
so it's interesting to think about some
ways that we can also in another ways
potentially perform this down sampling
operation now the key idea here of these
convolutional neural networks
and how we're now with all of this
knowledge we're kind of ready to put
this together and perform these
end-to-end networks so we have the three
main steps that I talked to you about
before convolution local nonlinearities
and pooling operations and with CNN's we
can layer these operations to learn a
hierarchy of features and a hierarchy of
features that we want to detect if
they're present in the image data or not
so a CNN built for image classification
I'm showing the first part of that CNN
here on the Left we can break it down
roughly into two parts so the first part
I'm showing here is the part of feature
learning so that's where we want to
extract those features and learn the
features from our image data this is
simply applying that same idea that I
showed you before we're gonna stack
convolution and nonlinearities with
pooling operations and repeat this
throughout the depth of our network the
next step for our convolutional neural
network is to take those extracted or
learned features and to classify our
image right so the ultimate goal here is
not to extract features we want to
extract features but then use them to
make some classification or make some
decision based on our image so we can
feed these outputted features into a
fully connected or dense layer and that
dense layer can output a probability
distribution over the image membership
in different categories or classes and
we do this using a function called
softmax which you actually already got
some experience with in lab 1 whose
output represents this categorical
distribution so now let's put this all
together into coding our first
end-to-end convolutional neural network
from scratch we'll start by defining our
feature extraction head which starts
with a convolutional layer here I'm
showing with 32 filters so 32 is coming
from this number right here that's the
number of filters that we want to
extract inside of this first
convolutional layer we down sample the
spatial information using a max pooling
operation like I discussed earlier
and next we feed this into the next set
of convolutional layers in our network
so now instead of 32 features we're
gonna be extracting even more features
so now we're extracting 64 features then
finally we can flatten this all of the
spatial information and the spatial
features that we've learned into a
vector and learn our final probability
distribution over class membership and
that allows us to actually classify this
image into one of these different
classes that we've defined so far we've
talked only about using CN NS for image
classification tasks in reality this
architecture extends to many many
different types of tasks and many many
different types of applications as well
when we're considering CN NS for
classification we saw that it has two
main parts first being the feature
learning part shown here and then a
classification part and the second part
of the pipeline what makes a
convolutional neural network so powerful
is that you can take this feature
extraction part of the pipeline and at
the output you can attach whatever kind
of output that you want to it so you can
just treat this convolutional feature
extractor simply as that a feature
extractor and then plug in whatever
other type of neural network you want at
its output so you can do detection by
changing the output head you can do
semantic segmentation which is where you
want to detect semantic classes for
every pixel in your image you can also
do ant and robotic control like we saw
with the tongue that's driving before so
what's an example of this we've seen a
significant impact in computer vision in
medicine and healthcare over the last
couple of years just a couple weeks ago
actually there was this paper that came
out where deep learning models have been
applied to the analysis of a whole host
of breast and the sry mammogram cancer
detection or yeah breast cancer
detection in mammogram images so what we
showed what was showed here was that CNN
were able to significantly outperform
expert radiologists and detecting breast
cancer directly from these mammogram
images that's done by feeding these
images through a convolutional feature
extract
they're out putting that those features
those learn features to dense layers and
then performing classification based on
those dense layers instead of predicting
a single number breast cancer or no
breast cancer you could also imagine for
every pixel in that image you want to
predict what is the class of that pixel
so here we're showing a picture of two
cows on the left
those are fed into a convolutional
feature extractor and then they're up
scaled through the inverse convolutional
decoder to predict for every pixel in
that image what is the class of that
pixel so you can see that the network is
able to correctly classify that it sees
two cows and brown whereas the grass is
in green and the sky is in blue and this
is basically detection but not for a
single number over the image yes or no
there's a cow or in this image but for
every pixel what is the class of this
pixel this is a much harder problem and
this output is actually created using
these up sampling operations so this is
no longer a dense neural network here
but we have kind of inverse or what are
called transpose convolutions which
scale back up our image data and allow
us to predict these images as outputs
and not just single numbers or single
probability distributions and of course
this idea can be you can imagine very
easily applied to many other
applications in healthcare as well
especially for segmenting various types
of cancers such as here we're showing
brain tumors on the top as well as parts
of the blood that are infected with
malaria on the bottom so let's see one
final example before before ending this
lecture and here we're showing again and
going back to the example of
self-driving cars and the idea again is
pretty similar so let's say we want to
learn a neural network to control a
self-driving car and learn autonomous
navigation specifically we want to go
from a model we're using our model to go
from images of the road maybe from a
camera attached to that car on top of
the car you can think of the actual
pixels coming from this camera that are
fed to the neural network
and in addition to the pixels coming
from the camera we also have these
images from a bird's-eye Street view of
where the car roughly is in the world
and we can feed both of those images
these are just two two-dimensional
arrays so this is one two-dimensional
array of images or pixels excuse me and
this is another two-dimensional array of
pixels both represent different things
so this represents your perception of
the world around you and this represents
roughly where you are in the world
globally and what we want to do with
this is then to directly predict or
infer a full distribution of possible
control actions that the car could take
at that's instant so if it doesn't have
any goal destination in mind they could
say that I could take any of these three
directions and steer in those directions
and that's what we want to predict with
this Network one way I do this is that
you can actually train your neural
network to take as input these camera
images coming from the car pass them
each through these convolutional
encoders or feature extractors and then
now that you've learned features for
each of those images you can concatenate
all of them together so now you have a
global set of features across all of
your sensor data and then learn your
control outputs from those on the right
hand side now again this is done
entirely end to end right so we never
told the car what a lane marker was what
a road was or how to even turn right or
left or what's an intersection so we
never told any of that information but
it's able to learn all of this and
extract those features from scratch just
by watching a lot of human driving data
and learn how to drive on its own so
here's an example of how a human can
actually enter the car input a desired
destination which you can see on the top
right the red line indicates where we
want the car to go in the map so think
of this as like a Google map so you plug
into Google Maps where you want to go
and the antenna see and then the
convolutional neural network will output
the control commands given what it sees
on the road to actually activate that
vehicle towards that destination note
here that the vehicle is able to
like sickness successfully navigate
through those intersections even though
it's never been driving in this area
before it's never seen these roads
before and we never even told it what in
an intersection was it learned all of
this from data using convolutional
neural networks now the impact of cnn's
has been very very wide reaching beyond
these examples that I've given to today
and it has touched so many different
fields of computer vision ranging across
robotics and as medicine and many many
other fields I'd like to conclude by
taking a look at what we've covered in
today's lecture we first considered the
origins of computer vision and how
images are represented as brightness
values to a computer and how these
convolution operations work in practice
right so then we discussed the basic
architecture and how we get build-up
from convolution operations to build
convolutional layers and then pass that
to convolutional neural networks and
finally we talked about the extensions
and applications of convolutional neural
networks and how we can visualize a
little bit of the behavior and actually
actuate some of the real world with
convolutional neural networks either by
predicting some parts of medicine or
some parts of medical scans or even
activating robots to interact with
humans in the real world and that's it
for the CNN lecture on computer vision
next up we'll hear from alpha a deep
generate generative modeling and thank
you
you
