- So welcome everyone to CS231n.
I'm super excited to
offer this class again
for the third time.
It seems that every
time we offer this class
it's growing exponentially
unlike most things in the world.
This is the third time
we're teaching this class.
The first time we had 150 students.
Last year, we had 350
students, so it doubled.
This year we've doubled
again to about 730 students
when I checked this morning.
So anyone who was not able
to fit into the lecture hall
I apologize.
But, the videos will be
up on the SCPD website
within about two hours.
So if you weren't able to come today,
then you can still check it
out within a couple hours.
So this class CS231n is
really about computer vision.
And, what is computer vision?
Computer vision is really
the study of visual data.
Since there's so many people
enrolled in this class,
I think I probably don't
need to convince you
that this is an important problem,
but I'm still going to
try to do that anyway.
The amount of visual data in our world
has really exploded to a ridiculous degree
in the last couple of years.
And, this is largely a
result of the large number
of sensors in the world.
Probably most of us in this room
are carrying around smartphones,
and each smartphone has one, two,
or maybe even three cameras on it.
So I think on average
there's even more cameras
in the world than there are people.
And, as a result of all of these sensors,
there's just a crazy large, massive amount
of visual data being produced
out there in the world
each day.
So one statistic that I
really like to kind of put
this in perspective is a 2015 study
from CISCO that estimated that by 2017
which is where we are now that roughly 80%
of all traffic on the
internet would be video.
This is not even counting all the images
and other types of visual data on the web.
But, just from a pure
number of bits perspective,
the majority of bits
flying around the internet
are actually visual data.
So it's really critical
that we develop algorithms
that can utilize and understand this data.
However, there's a
problem with visual data,
and that's that it's
really hard to understand.
Sometimes we call visual
data the dark matter
of the internet in analogy
with dark matter in physics.
So for those of you who have
heard of this in physics
before, dark matter accounts
for some astonishingly large
fraction of the mass in the universe,
and we know about it due to the existence
of gravitational pulls on
various celestial bodies
and what not, but we
can't directly observe it.
And, visual data on the
internet is much the same
where it comprises the majority of bits
flying around the internet,
but it's very difficult
for algorithms to actually
go in and understand
and see what exactly is
comprising all the visual data
on the web.
Another statistic that I
like is that of Youtube.
So roughly every second of clock time
that happens in the world,
there's something like five hours
of video being uploaded to Youtube.
So if we just sit here and count,
one, two, three, now there's 15 more hours
of video on Youtube.
Google has a lot of
employees, but there's no way
that they could ever
have an employee sit down
and watch and understand
and annotate every video.
So if they want to catalog and serve you
relevant videos and maybe
monetize by putting ads
on those videos, it's really
crucial that we develop
technologies that can dive in
and automatically understand
the content of visual data.
So this field of computer vision is
truly an interdisciplinary
field, and it touches
on many different areas of science
and engineering and technology.
So obviously, computer vision's
the center of the universe,
but sort of as a constellation of fields
around computer vision, we
touch on areas like physics
because we need to understand
optics and image formation
and how images are
actually physically formed.
We need to understand
biology and psychology
to understand how animal
brains physically see
and process visual information.
We of course draw a lot
on computer science,
mathematics, and engineering
as we actually strive
to build computer systems that implement
our computer vision algorithms.
So a little bit more about
where I'm coming from
and about where the teaching
staff of this course
is coming from.
Me and my co-instructor
Serena are both PHD students
in the Stanford Vision Lab which is headed
by professor Fei-Fei Li,
and our lab really focuses
on machine learning and
the computer science side
of things.
I work a little bit more
on language and vision.
I've done some projects in that.
And, other folks in our group have worked
a little bit on the neuroscience
and cognitive science
side of things.
So as a bit of introduction,
you might be curious
about how this course relates
to other courses at Stanford.
So we kind of assume a basic
introductory understanding
of computer vision.
So if you're kind of an undergrad,
and you've never seen
computer vision before,
maybe you should've taken
CS131 which was offered
earlier this year by Fei-Fei
and Juan Carlos Niebles.
There was a course taught last quarter
by Professor Chris
Manning and Richard Socher
about the intersection of deep learning
and natural language processing.
And, I imagine a number of
you may have taken that course
last quarter.
There'll be some overlap
between this course and that,
but we're really focusing
on the computer vision
side of thing, and really
focusing all of our motivation
in computer vision.
Also concurrently taught this quarter
is CS231a taught by
Professor Silvio Savarese.
And, CS231a really focuses
is a more all encompassing
computer vision course.
It's focusing on things
like 3D reconstruction,
on matching and robotic vision,
and it's a bit more all encompassing
with regards to vision than our course.
And, this course, CS231n, really focuses
on a particular class
of algorithms revolving
around neural networks and
especially convolutional
neural networks and their applications
to various visual recognition tasks.
Of course, there's also a number
of seminar courses that are taught,
and you'll have to check the syllabus
and course schedule for
more details on those
'cause they vary a bit each year.
So this lecture is normally given
by Professor Fei-Fei Li.
Unfortunately, she wasn't
able to be here today,
so instead for the majority of the lecture
we're going to tag team a little bit.
She actually recorded a
bit of pre-recorded audio
describing to you the
history of computer vision
because this class is a
computer vision course,
and it's very critical and
important that you understand
the history and the context
of all the existing work
that led us to these developments
of convolutional neural
networks as we know them today.
I'll let virtual Fei-Fei take over
[laughing]
and give you a brief
introduction to the history
of computer vision.
Okay let's start with today's agenda. 
So we have two topics to cover one is a
brief history of computer vision and the
other one is the overview of our course
CS 231 so we'll start with a very
brief history of where vision comes
from when did computer vision start and
where we are today. The history the
history of vision can go back many many
years ago in fact about 543 million
years ago. What was life like during that
time? Well the earth was mostly water
there were a few species of animals
floating around in the ocean and life
was very chill. Animals didn't move around
much there they don't have eyes or
anything when food swims by they grab
them if the food didn't swim by they
just float around but something really
remarkable happened around 540 million
years ago. From fossil studies zoologists
found out within a very short period of
time —  ten million years — the number of
animal species just exploded. It went
from a few of them to hundreds of
thousands and that was strange — what caused this?
There were many theories but for many
years it was a mystery evolutionary
biologists call this evolution's Big Bang.
A few years ago an Australian zoologist
called Andrew Parker proposed one of the
most convincing theory from the studies
of fossils
he discovered around 540 million years
ago the first animals developed eyes and
the onset of vision started this
explosive speciation phase. Animals can
suddenly see; once you can see life
becomes much more proactive. Some
predators went after prey and prey
have to escape from predators so the
evolution or onset of vision started a
evolutionary arms race and animals had
to evolve quickly in order to survive as
a species so that was the beginning of
vision in animals after 540 million
years vision has developed into the
biggest sensory system of almost all
animals especially intelligent animals
in humans we have almost 50% of the
neurons in our cortex involved in visual
processing it is the biggest sensory
system that enables us to survive, work,
move around, manipulate things,
communicate, entertain, and many things.
The vision is really important for
animals and especially intelligent
animals. So that was a quick story of
biological vision. What about humans, the
history of humans making mechanical
vision or cameras? Well one of the early
cameras that we know today is from the
1600s, the Renaissance period of time,
camera obscura and this is a camera
based on pinhole camera theories. It's
very similar to, it's very similar to the
to the early eyes that animals developed
with a hole that collects lights
and then a plane in the back of the
camera that collects the information and
project the imagery. So
as cameras evolved, today we have cameras
everywhere this is one of the most
popular sensors people use from
smartphones to to other sensors. In the
mean time biologists started
studying the mechanism of vision. One of
the most influential work in both human
vision where animal vision as well as
that inspired computer vision is the
work done by Hubel and Wiesel in the 50s
and 60s using electrophysiology.
What they were asking, the question is "what was the visual processing mechanism like
in primates, in mammals" so they chose
to study cat brain which is more or less
similar to human brain from a visual
processing point of view. What they did
is to stick some electrodes in the back
of the cat brain which is where the
primary visual cortex area is and then
look at what stimuli makes the neurons
in the in the back in the primary visual
cortex of cat brain respond excitedly
what they learned is that there are many
types of cells in the, in the primary
visual cortex part of the the cat brain
but one of the most important cell is
the simple cells they respond to
oriented edges when they move in certain
directions. Of course there are also more
complex cells but by and large what they
discovered is visual processing starts
with simple structure of the visual world,
oriented edges and as information
moves along the visual processing
pathway the brain builds up the
complexity of the visual information
until it can recognize the complex
visual world. So the history of
computer vision also starts around early
60s. Block World is a set of work
published by Larry Roberts which is
widely known as one of the first,
probably the first PhD thesis of
computer vision where the visual world
was simplified into simple geometric
shapes and the goal is to be able to
recognize them and reconstruct what
these shapes are. In 1966 there was a now
famous MIT summer project called "The
Summer Vision Project." The goal of this
Summer Vision Project, I read: "is an
attempt to use our summer workers
effectively in a construction of a
significant part of a visual system."
So the goal is in one summer we're gonna work
out
the bulk of the visual system. That was
an ambitious goal. Fifty years have
passed; the field of computer vision has
blossomed from one summer project into a
field of thousands of researchers
worldwide still working on some of the
most fundamental problems of vision. We
still have not yet solved vision but it
has grown into one of the most important
and fastest growing areas
of artificial intelligence. Another
person that we should pay tribute to is
David Marr. David Marr was a MIT vision
scientist and he has written an
influential book in the late 70s about
what he thinks vision is and how we
should go about computer vision
and developing algorithms that can
enable computers to recognize the visual
world. The thought process in his,
in David Mars book is
that in order to take an image and
arrive at a final holistic full 3d
representation of the visual world we
have to go through several process. The
first process is what he calls "primal sketch;"
this is where mostly the edges,
the bars, the ends, the virtual lines, the
curves, the boundaries, are represented
and this is very much inspired by what
neuroscientists have seen: Hubel and
Wiesel told us the early stage of visual
processing has a lot to do with simple
structures like edges. Then the next step
after the edges and the curves is what David Marr calls
"two-and-a-half d sketch;" this is where we
start to piece together the surfaces,
the depth information, the layers, or the
discontinuities of the visual scene,
and then eventually we put everything
together and have a 3d model
hierarchically organized in terms of
surface and volumetric primitives and so on.
So that was a very idealized thought
process of what vision is and this way
of thinking actually has dominated
computer vision for several decades and
is also a very intuitive way for
students to enter the field of vision
and think about how we can deconstruct
the visual information.
Another very important seminal group of
work happened in the 70s where people
began to ask the question "how can we
move beyond the simple block world and
start recognizing or representing real
world objects?" Think about the 70s,
it's the time that there's very little
data available; computers are extremely
slow, PCs are not even around,
but computer scientists are starting to
think about how we can recognize and
represent objects. So in Palo Alto
both at Stanford as well as SRI, two
groups of scientists that propose
similar ideas: one is called "generalized
cylinder," one is called "pictorial structure."
The basic idea is that every
object is composed of simple geometric
primitives; for example a person can be
pieced together by generalized
cylindrical shapes or a person can be
pieced together by critical part in
their elastic distance between
these parts
so either representation is a way to
reduce the complex structure of the
object into a collection of
simpler shapes and their geometric configuration.
These work have been
influential for quite a few, quite a few years
and then in the 80s David Lowe, here
is another example of thinking how to
reconstruct or recognize the visual
world from simple world structures, this
work is by David Lowe which he tries to
recognize razors by constructing
lines and edges and and mostly
straight lines and their combination.
So there was a lot of effort in trying to
think what what is the tasks in computer
vision in the 60s 70s and 80s and frankly
it was very hard to solve the problem of
object recognition; everything I've shown
you so far are very audacious ambitious
attempts but they remain at the level of
toy examples
or just a few examples. Not a lot of
progress have been made in terms of
delivering something that can work in
real world. So as people think about what
are the problems to solving vision one
important question came around is:
if object recognition is too hard,
maybe we should first do object segmentation,
that is the task of taking
an image and group the pixels into meaningful areas.
We might not know the
pixels that group together is called a person,
but we can extract out all the
pixels that belong to the person from its background;
that is called image
segmentation. So here's one very early
seminal work by Jitendra Malik and his
student Jianbo Shi from Berkeley from
using a graph theory algorithm for the
problem of image segmentation.
Here's another problem that made some headway
ahead of many other problems in
computer vision, which is face detection.
Faces one of the most important objects
to humans, probably the most important
objects to humans, around the time of
1999 to 2000 machine learning techniques,
especially statistical machine
learning techniques start to gain
momentum. These are techniques such as
support vector machines, boosting,
graphical models, including the first
wave of neural networks. One particular
work that made a lot of contribution was
using AdaBoost algorithm to do
real-time face detection by Paul Viola
and Michael Jones and there's a lot to
admire in this work. It was done in 2001
when computer chips are still very very
slow but they're able to do face
detection in
images in near-real-time and after the
publication of this paper in five years
time, 2006, Fujifilm rolled out the first
digital camera that has a real-time
face detector in the in the camera so it
was a very rapid transfer from basic
science research to real world application.
So as a field we continue to
explore how we can do object recognition
better so one of the very influential
way of thinking in the late 90s til the
first 10 years of 2000 is feature based
object recognition and here is a seminal
work by David Lowe called SIFT feature. 
The idea is that to match and the entire object
for example here is a stop sign to
another stop sight is very difficult
because there might be all kinds of
changes due to camera angles, occlusion,
viewpoint, lighting, and just the
intrinsic variation of the object itself
but it's inspired to observe that there
are some parts of the object,
some features, that tend to remain diagnostic
and invariant to changes so the task of
object recognition began with identifying
these critical features on the object
and then match the features to a similar
object, that's a easier task than pattern
matching the entire object. So here is a
figure from his paper where it shows
that a handful, several dozen SIFT
features from one stop sign are
identified and matched to the SIFT
features of another stop sign.
Using the same building block which is
features, diagnostic features in images,
we have as a field has made another step
forward and start to recognizing
holistic scenes. Here is an example
algorithm called Spatial Pyramid Matching;
the idea is that there are
features in the images that can give us
clues about which type of scene it is,
whether it's a landscape or a kitchen or
a highway and so on and this particular
work takes these features from different
parts of the image and in different
resolutions and put them together in a
feature descriptor and then we do
support vector machine algorithm on top of that.
Similarly a very similar work
has gained momentum in human recognition
so putting together these features well
we have a number of work that looks at
how we can compose human bodies in more
realistic images and recognize them.
So one work is called the "histogram of
gradients," another work is called
"deformable part models," so as you
can see as we move from the 60s 70s 80s
towards the first decade of the 21st
century one thing is changing and that's
the quality of the pictures were no
longer, with the Internet the the the
growth of the Internet the digital
cameras were having better and better
data to study computer vision. So one of
the outcome in the early 2000s is that
the field of computer vision has defined
a very important building block problem to solve.
It's not the only problem to solve but
in terms of recognition this is a very
important problem to solve which is
object recognition. I talked about object
recognition all along but in the early
2000s we began to have benchmark data
set that can enable us to measure the
progress of object recognition. One of
the most influential benchmark data set
is called PASCAL Visual Object Challenge,
and it's a data set composed of 20
object classes, three of them are shown
here: train, airplane, person; I think it
also has cows, bottles, cats, and so on; and
the data set is composed of several
thousand to ten thousand images per
category and then the field different
groups develop algorithm to test
against the testing set and see how we
have made progress. So here is a figure
that shows from year 2007 to year 2012.
The performance on detecting objects the
20 object in this image in a in a
benchmark data set has steadily
increased. So there was a lot of progress made.
Around that time a group of us from
Princeton to Stanford also began to ask
a harder question to ourselves as well
as our field which is: are we ready
to recognize every object or most of the
object in the world. It's also motivated
by an observation that is rooted in
machine learning which is that most of
the machine learning algorithms it
doesn't matter if it's graphical model,
or support vector machine, or AdaBoost,
is very likely to overfit in
the training process and part of the
problem is visual data is very complex
because it's complex our models tend to
have a high dimension a high dimension
of input and have to have a lot of
parameters to fit and when we don't have
enough training data overfitting happens
very fast and then we cannot generalize
very well. So motivated by this dual
reason, one is just want to recognize the
world of all the objects, the other
one is to come back the machine learning
overcome the the machine learning
bottleneck of overfitting, we began this
project called ImageNet. We wanted to
put together the largest possible dataset
of all the pictures we can find, the
world of objects, and use that for
training as well as for benchmarking. So
it was a project that took us about
three years, lots of hard work; it
basically began with downloading
billions of images from the internet
organized by the dictionary we called
WordNet which is tens of thousands of
object classes and then we have to use
some clever crowd engineering trick a
method using Amazon Mechanical Turk
platform to sort, clean, label each of the
images. The end result is a ImageNet of
almost 15 million or 40 million plus
images organized in twenty-two thousand
categories of objects and scenes and
this is the gigantic, probably the
biggest dataset produced in the field of
AI at that time and it began to push
forward the algorithm development of
object recognition into another phase.
Especially important is how to benchmark
the progress
so starting 2009 the ImageNet team rolled
out an international challenge called
ImageNet Large-Scale Visual Recognition
Challenge and for this challenge we put
together a more stringent test set of
1.4 million objects across 1,000 object
classes and this is to test the image
classification recognition results for
the computer vision algorithms. So here's
the example picture and if an algorithm
can output 5 labels and and top five
labels includes the correct object in
this picture then we call this a success.
So here is a result summary of the
ImageNet Challenge, of the image
classification result from 2010
to 2015 so on x axis you see the
years and the y axis you see the error rate.
So the good news is the error rate
is steadily decreasing to the point by
2012 the error rate is so low is on par
with what humans can do and here a human
I mean a single Stanford PhD student who
spend weeks doing this task as if
he were a computer participating in the
ImageNet Challenge. So that's a lot of
progress made even though we have not
solved all the problems of object
recognition which you'll learn about in
this class
but to go from an error rate that's
unacceptable for real-world application
all the way to on par being on par with
humans in ImageNet challenge, the field
took only a few years. And one particular
moment you should notice on this graph
is the the year 2012. In the first two
years our error rate hovered around 25
percent but in 2012 the error rate was
dropped more almost 10 percent to 16
percent even though now it's better but
that drop was very significant and the
winning algorithm of that year is a
convolutional neural network model that
beat all other algorithms around that
time to win the ImageNet challenge and
this is the focus of our whole course
this quarter is to look at to have a
deep dive into what convolutional neural
network models are and another name for
this is deep learning by by popular
popular name now it's called deep
learning and to look at what these
models are what are the principles what
are the good practices what are the
recent progress of this model, but
here is where the history was made is
that we, around 2012 convolutional
neural network model or deep learning
models showed the tremendous capacity
and ability in making a good progress in
the field of computer vision along with
several other sister fields like natural
language processing and speech
recognition. So without further ado I'm
going to hand the rest of the lecture to
to Justin to talk about the overview of
CS 231n. 
Alright, thanks so much Fei-Fei.
I'll take it over from here.
So now I want to shift gears a little bit
and talk a little bit more
about this class CS231n.
So this class focuses
on one of these most,
so the primary focus of this class
is this image classification problem
which we previewed a
little bit in the contex
of the ImageNet Challenge.
So in image classification, again,
the setup is that your
algorithm looks at an image
and then picks from among
some fixed set of categories
to classify that image.
And, this might seem like
somewhat of a restrictive
or artificial setup, but
it's actual quite general.
And, this problem can be applied
in many different settings
both in industry and academia
and many different places.
So for example, you could
apply this to recognizing food
or recognizing calories
in food or recognizing
different artworks, different
product out in the world.
So this relatively basic
tool of image classification
is super useful on its
own and could be applied
all over the place for many
different applications.
But, in this course,
we're also going to talk
about several other visual
recognition problems
that build upon many of
the tools that we develop
for the purpose of image classification.
We'll talk about other problems
such as object detection
or image captioning.
So the setup in object detection
is a little bit different.
Rather than classifying an entire image
as a cat or a dog or a horse or whatnot,
instead we want to go in
and draw bounding boxes
and say that there is a
dog here, and a cat here,
and a car over in the background,
and draw these boxes describing
where objects are in the image.
We'll also talk about image captioning
where given an image the system
now needs to produce a
natural language sentence
describing the image.
It sounds like a really hard, complicated,
and different problem, but we'll see
that many of the tools that we develop
in service of image classification
will be reused in these
other problems as well.
So we mentioned this before in the context
of the ImageNet Challenge,
but one of the things
that's really driven the
progress of the field
in recent years has been this adoption
of convolutional neural networks or CNNs
or sometimes called convnets.
So if we look at the
algorithms that have won
the ImageNet Challenge for
the last several years,
in 2011 we see this method from Lin et al
which is still hierarchical.
It consists of multiple layers.
So first we compute some features,
next we compute some local invariances,
some pooling, and go
through several layers
of processing, and then finally feed
this resulting descriptor to a linear SVN.
What you'll notice here is that
this is still hierarchical.
We're still detecting edges.
We're still having notions of invariance.
And, many of these
intuitions will carry over
into convnets.
But, the breakthrough
moment was really in 2012
when Jeff Hinton's group in Toronto
together with Alex
Krizhevsky and Ilya Sutskever
who were his PHD student at that time
created this seven layer
convolutional neural network
now known as AlexNet,
then called Supervision
which just did very, very well
in the ImageNet competition
in 2012.
And, since then every year
the winner of ImageNet
has been a neural network.
And, the trend has been
that these networks
are getting deeper and deeper each year.
So AlexNet was a seven or
eight layer neural network
depending on how exactly you count things.
In 2015 we had these much deeper networks.
GoogleNet from Google
and VGG, the VGG network
from Oxford which was about
19 layers at that time.
And, then in 2015 it got really crazy
and this paper came out
from Microsoft Research Asia
called Residual Networks which
were 152 layers at that time.
And, since then it turns out you can get
a little bit better if you go up to 200,
but you run our of memory on your GPUs.
We'll get into all of that later,
but the main takeaway here
is that convolutional neural
networks really had
this breakthrough moment
in 2012, and since then there's been
a lot of effort focused
in tuning and tweaking
these algorithms to make them
perform better and better
on this problem of image classification.
And, throughout the rest of the quarter,
we're going to really dive in deep,
and you'll understand exactly
how these different models
work.
But, one point that's really important,
it's true that the breakthrough moment
for convolutional neural
networks was in 2012
when these networks performed very well
on the ImageNet Challenge,
but they certainly weren't
invented in 2012.
These algorithms had actually been around
for quite a long time before that.
So one of the sort of foundational works
in this area of
convolutional neural networks
was actually in the '90s from
Jan LeCun and collaborators
who at that time were at Bell Labs.
So in 1998 they build this
convolutional neural network
for recognizing digits.
They wanted to deploy
this and wanted to be able
to automatically recognize
handwritten checks
or addresses for the post office.
And, they built this
convolutional neural network
which could take in the pixels of an image
and then classify either what digit it was
or what letter it was or whatnot.
And, the structure of this network
actually look pretty
similar to the AlexNet
architecture that was used in 2012.
Here we see that, you know, we're taking
in these raw pixels.
We have many layers of
convolution and sub-sampling,
together with the so called
fully connected layers.
All of which will be
explained in much more detail
later in the course.
But, if you just kind of
look at these two pictures,
they look pretty similar.
And, this architecture in 2012 has a lot
of these architectural similarities
that are shared with this
network going back to the '90s.
So then the question you might ask
is if these algorithms
were around since the '90s,
why have they only suddenly become popular
in the last couple of years?
And, there's a couple
really key innovations
that happened that have
changed since the '90s.
One is computation.
Thanks to Moore's law, we've gotten
faster and faster computers every year.
And, this is kind of a coarse measure,
but if you just look at
the number of transistors
that are on chips, then that has grown
by several orders of magnitude
between the '90s and today.
We've also had this advent
of graphics processing units
or GPUs which are super parallelizable
and ended up being a perfect tool
for really crunching these
computationally intensive
convolutional neural network models.
So just by having more compute available,
it allowed researchers to
explore with larger architectures
and larger models, and in some cases,
just increasing the model
size, but still using
these kind of classical approaches
and classical algorithms
tends to work quite well.
So this idea of increasing computation
is super important in the
history of deep learning.
I think the second key
innovation that changed
between now and the '90s was data.
So these algorithms are
very hungry for data.
You need to feed them
a lot of labeled images
and labeled pixels for them
to eventually work quite well.
And, in the '90s there just wasn't
that much labeled data available.
This was, again, before
tools like Mechanical Turk,
before the internet was
super, super widely used.
And, it was very difficult to collect
large, varied datasets.
But, now in the 2010s
with datasets like PASCAL
and ImageNet, there existed
these relatively large,
high quality labeled
datasets that were, again,
orders and orders magnitude bigger
than the dataset available in the '90s.
And, these much large datasets, again,
allowed us to work with
higher capacity models
and train these models to
actually work quite well
on real world problems.
But, the critical takeaway here is
that convolutional neural networks
although they seem like this
sort of fancy, new thing
that's only popped up in
the last couple of years,
that's really not the case.
And, these class of
algorithms have existed
for quite a long time in
their own right as well.
Another thing I'd like to point out
in computer vision we're in the business
of trying to build machines
that can see like people.
And, people can actually
do a lot of amazing things
with their visual systems.
When you go around the world,
you do a lot more than just drawing boxes
around the objects and classifying
things as cats or dogs.
Your visual system is much
more powerful than that.
And, as we move forward in the field,
I think there's still a
ton of open challenges
and open problems that we need to address.
And, we need to continue
to develop our algorithms
to do even better and tackle
even more ambitious problems.
Some examples of this are
going back to these older ideas
in fact.
Things like semantic segmentation
or perceptual grouping
where rather than
labeling the entire image,
we want to understand for
every pixel in the image
what is it doing, what does it mean.
And, we'll revisit that
idea a little bit later
in the course.
There's definitely work going back
to this idea of 3D understanding,
of reconstructing the entire world,
and that's still an
unsolved problem I think.
There're just tons and tons of other tasks
that you can imagine.
For example activity recognition,
if I'm given a video of some person
doing some activity, what's the best way
to recognize that activity?
That's quite a challenging
problem as well.
And, then as we move forward with things
like augmented reality
and virtual reality,
and as new technologies
and new types of sensors
become available, I think we'll come up
with a lot of new, interesting
hard and challenging
problems to tackle as a field.
So this is an example
from some of my own work
in the vision lab on this
dataset called Visual Genome.
So here the idea is that
we're trying to capture
some of these intricacies
in the real world.
Rather than maybe describing just boxes,
maybe we should be describing images
as these whole large graphs
of semantically related
concepts that encompass
not just object identities
but also object relationships,
object attributes,
actions that are occurring in the scene,
and this type of
representation might allow us
to capture some of this
richness of the visual world
that's left on the table when we're using
simple classification.
This is by no means a standard
approach at this point,
but just kind of giving you this sense
that there's so much more
that your visual system can do
that is maybe not captured in this vanilla
image classification setup.
I think another really interesting work
that kind of points in this direction
actually comes from
Fei-Fei's grad school days
when she was doing her PHD at Cal Tech
with her advisors there.
In this setup, they had
people, they stuck people,
and they showed people this
image for just half a second.
So they flashed this
image in front of them
for just a very short period of time,
and even in this very, very rapid exposure
to an image, people were able to write
these long descriptive paragraphs
giving a whole story of the image.
And, this is quite remarkable
if you think about it
that after just half a second
of looking at this image,
a person was able to say that this is
some kind of a game or
fight, two groups of men.
The man on the left is throwing something.
Outdoors because it seem like
I have an impression of grass,
and so on and so on.
And, you can imagine that if a person
were to look even longer at this image,
they could write probably a whole novel
about who these people
are, and why are they
in this field playing this game.
They could go on and on and on
roping in things from
their external knowledge
and their prior experience.
This is in some sense the
holy grail of computer vision.
To sort of understand
the story of an image
in a very rich and deep way.
And, I think that despite
the massive progress
in the field that we've had
over the past several years,
we're still quite a long way
from achieving this holy grail.
Another image that I
think really exemplifies
this idea actually comes, again,
from Andrej Karpathy's blog
is this amazing image.
Many of you smiled, many of you laughed.
I think this is a pretty funny image.
But, why is it a funny image?
Well we've got a man standing on a scale,
and we know that people
are kind of self conscious
about their weight sometimes,
and scales measure weight.
Then we've got this other guy behind him
pushing his foot down on the scale,
and we know that because
of the way scales work
that will cause him to
have an inflated reading
on the scale.
But, there's more.
We know that this person
is not just any person.
This is actually Barack
Obama who was at the time
President of the United States,
and we know that Presidents
of the United States
are supposed to be respectable
politicians that are
[laughing]
probably not supposed to be playing jokes
on their compatriots in this way.
We know that there's these people
in the background that
are laughing and smiling,
and we know that that means that they're
understanding something about the scene.
We have some understanding that they know
that President Obama
is this respectable guy
who's looking at this other guy.
Like, this is crazy.
There's so much going on in this image.
And, our computer vision algorithms today
are actually a long way
I think from this true,
deep understanding of images.
So I think that sort of
despite the massive progress
in the field, we really
have a long way to go.
To me, that's really
exciting as a researcher
'cause I think that we'll have
just a lot of really
exciting, cool problems
to tackle moving forward.
So I hope at this point I've
done a relatively good job
to convince you that computer
vision is really interesting.
It's really exciting.
It can be very useful.
It can go out and make
the world a better place
in various ways.
Computer vision could be applied
in places like medical
diagnosis and self-driving cars
and robotics and all
these different places.
In addition to sort of tying
back to sort of this core
idea of understanding human intelligence.
So to me, I think that computer vision
is this fantastically
amazing, interesting field,
and I'm really glad that over the course
of the quarter, we'll
get to really dive in
and dig into all these different details
about how these algorithms
are working these days.
That's sort of my pitch
about computer vision
and about the history of computer vision.
I don't know if there's
any questions about this
at this time.
Okay.
So then I want to talk a little bit more
about the logistics of this class
for the rest of the quarter.
So you might ask who are we?
So this class is taught by Fei-Fei Li
who is a professor of computer
science here at Standford
who's my advisor and director
of the Stanford Vision Lab
and also the Stanford AI Lab.
The other two instructors
are me, Justin Johnson,
and Serena Yeung who is
up here in the front.
We're both PHD students
working under Fei-Fei
on various computer vision problems.
We have an amazing
teaching staff this year
of 18 TAs so far.
Many of whom are sitting
over here in the front.
These guys are really the unsung heroes
behind the scenes making
the course run smoothly,
making sure everything happens well.
So be nice to them.
[laughing]
I think I also should mention
this is the third time
we've taught this course,
and it's the first time
that Andrej Karpathy has
not been an instructor
in this course.
He was a very close friend of mine.
He's still alive.
He's okay, don't worry.
[laughing]
But, he graduated, so he's actually here
I think hanging around
in the lecture hall.
A lot of the development and
the history of this course
is really due to him working on it
with me over the last couple of years.
So I think you should be aware of that.
Also about logistics,
probably the best way
for keeping in touch with the course staff
is through Piazza.
You should all go and signup right now.
Piazza is really our preferred
method of communication
with the class with the teaching staff.
If you have questions that you're afraid
of being embarrassed about asking
in front of your classmates, go ahead
and ask anonymously even
post private questions
directly to the teaching staff.
So basically anything that you need
should ideally go through Piazza.
We also have a staff mailing list,
but we ask that this is mostly
for sort of personal, confidential things
that you don't want going on Piazza,
or if you have something
that's super confidential,
super personal, then feel free
to directly email me or
Fei-Fei or Serena about that.
But, for the most part,
most of your communication
with the staff should be through Piazza.
We also have an optional
textbook this year.
This is by no means required.
You can go through the course
totally fine without it.
Everything will be self contained.
This is sort of exciting
because it's maybe the first
textbook about deep
learning that got published
earlier this year by E.N. Goodfellow,
Yoshua Bengio, and Aaron Courville.
I put the Amazon link here in the slides.
You can get it if you want to,
but also the whole content of the book
is free online, so you
don't even have to buy it
if you don't want to.
So again, this is totally optional,
but we'll probably be
posting some readings
throughout the quarter
that give you an additional
perspective on some of the material.
So our philosophy about this class
is that you should really
understand the deep mechanics
of all of these algorithms.
You should understand at a very deep level
exactly how these algorithms are working
like what exactly is going on when you're
stitching together these neural networks,
how do these architectural decisions
influence how the network is trained
and tested and whatnot and all that.
And, throughout the course
through the assignments,
you'll be implementing
your own convolutional
neural networks from scratch in Python.
You'll be implementing the
full forward and backward
passes through these
things, and by the end,
you'll have implemented a whole
convolutional neural network
totally on your own.
I think that's really cool.
But, we also kind of
practical, and we know
that in most cases people
are not writing these things
from scratch, so we also want to give you
a good introduction to some
of the state of the art
software tools that are used
in practice for these things.
So we're going to talk about
some of the state of the art
software packages like Tensor
Flow, Torch, [Py]Torch,
all these other things.
And, I think you'll get some exposure
to those on the homeworks
and definitely through
the course project as well.
Another note about this course
is that it's very state of the art.
I think it's super exciting.
This is a very fast moving field.
As you saw, even these plots
in the imaging challenge
basically there's been a ton of progress
since 2012, and like while
I've been in grad school,
the whole field is sort
of transforming ever year.
And, that's super exciting
and super encouraging.
But, what that means is that
there's probably content
that we'll cover this
year that did not exist
the last time that this
course was taught last year.
I think that's super
exciting, and that's one
of my favorite parts
about teaching this course
is just roping in all
these new scientific,
hot off the presses stuff and being able
to present it to you guys.
We're also sort of about fun.
So we're going to talk
about some interesting
maybe not so serious
topics as well this quarter
including image captioning is pretty fun
where we can write
descriptions about images.
But, we'll also cover some
of these more artistic things
like DeepDream here on the left
where we can use neural
networks to hallucinate
these crazy, psychedelic images.
And, by the end of the course, you'll know
how that works.
Or on the right, this
idea of style transfer
where we can take an image and render it
in the style of famous artists
like Picasso or Van Gogh
or what not.
And again, by the end of the quarter,
you'll see how this stuff works.
So the way the course works
is we're going to have
three problem sets.
The first problem set
will hopefully be out
by the end of the week.
We'll have an in class,
written midterm exam.
And, a large portion of your grade
will be the final course
project where you'll work
in teams of one to three and produce
some amazing project that
will blow everyone's minds.
We have a late policy, so
you have seven late days
that you're free to allocate
among your different homeworks.
These are meant to cover
things like minor illnesses
or traveling or conferences
or anything like that.
If you come to us at
the end of the quarter
and say that, "I suddenly
have to give a presentation
"at this conference."
That's not going to be okay.
That's what your late days are for.
That being said, if you have some
very extenuating circumstances,
then do feel free
to email the course staff
if you have some extreme
circumstances about that.
Finally, I want to make a note
about the collaboration policy.
As Stanford students,
you should all be aware
of the honor code that governs the way
that you should be collaborating
and working together,
and we take this very seriously.
We encourage you to think very carefully
about how you're
collaborating and making sure
it's within the bounds of the honor code.
So in terms of prerequisites,
I think the most important
is probably a deep familiarity with Python
because all of the programming assignments
will be in Python.
Some familiarity with C
or C++ would be useful.
You will probably not
be writing any C or C++
in this course, but as you're
browsing through the source
code of these various software packages,
being able to read C++ code at least
is very useful for understanding
how these packages work.
We also assume that you
know what calculus is,
you know how to take derivatives
all that sort of stuff.
We assume some linear algebra.
That you know what matrices are
and how to multiply them
and stuff like that.
We can't be teaching you how to take
like derivatives and stuff.
We also assume a little bit of knowledge
coming in of computer
vision maybe at the level
of CS131 or 231a.
If you have taken those courses before,
you'll be fine.
If you haven't, I think
you'll be okay in this class,
but you might have a tiny
bit of catching up to do.
But, I think you'll probably be okay.
Those are not super strict prerequisites.
We also assume a little
bit of background knowledge
about machine learning
maybe at the level of CS229.
But again, I think really
important, key fundamental
machine learning concepts
we'll reintroduce
as they come up and become important.
But, that being said, a
familiarity with these things
will be helpful going forward.
So we have a course website.
Go check it out.
There's a lot of information and links
and syllabus and all that.
I think that's all that I
really want to cover today.
And, then later this week on Thursday,
we'll really dive into our
first learning algorithm
and start diving into the
details of these things.
