- Welcome to Human-Centered
Artificial Intelligence.
The last couple of decades
in the developments
of deep learning have been exciting
in the problems that we've
been able to automate,
in the problems that
we've been able to crack
with learning-based methods.
One of the things underlying this lecture
and the following lectures is the idea
that with purely the
learning-based approach
that we have been using,
there's certain aspects
that are fundamental to our reality
that we're going to hit a wall on,
that we have to integrate,
incorporate the human being
deeply into the learning-based systems
in order to make the systems learn well
and operate in the real world.
The underlying first prediction
under the idea of human-centered
AI in this century
is that the learning-based approaches
have been successful over
the past two decades,
like deep learning,
machine learning approaches
that learn from data,
are going to continue
to become better and dominate
the real-world applications.
So as opposed to fine-tuned
optimization-based models
that do not learn from data,
more and more we're going to
see learning-based methods
dominate real-world applications.
That's the underlying prediction
that we're working with.
Now, if that's the case,
the corollary of that,
if learning-based methods
is the solution to many of
these real-world problems
is the way we get smarter AI systems
is by improving the machine learning
and the machine teaching.
Machine learning is the thing
that we've been talking about quite a bit,
that's the deep learning,
that's the algorithm,
the optimization of
neural network parameters
where you learn from data.
That's the current focus of the community,
current focus in the research
and the thing that's behind the success
of much of the developments
in deep learning.
And then there's the machine teaching,
that's the human-centered part.
It's optimizing not the models,
not the algorithms, but optimizing
how you select the data based
on which the algorithms learn.
It's to make better teachers,
just like when you yourself
are learning as a student
or as a child how to
operate in this world,
the world and the parents
and the teachers around you
are informing you with
very sparse information
but providing the kind of information
that is most useful for
your learning process.
The selection of data
based on which to learn,
I believe, is the critical
direction of research
where we have to solve in order
to create truly intelligent systems
and ones that are able to
work in the real world,
and I'll explain why in
ways that I'm referring to.
The implications of
learning-based systems,
so when you have a learning system,
a system that learns from data,
neural networks, machine
learning, learns from data,
the fundamental reality of that
is the model is trying to generalize
across the entirety of the reality
in which we'll be tasked with operating
based on a very small subset
of samples from that reality.
And that generalization means that
there's always going to be
a degree of uncertainty,
there's always going to be a degree
of incomplete information,
and so no matter how much we want to,
these systems will not be provably safe,
so we can't put anything concrete down
to how guaranteed to be
safe in some specific way
unless it's extremely constrained,
therefore we need human
supervision of these systems.
The systems will not provably fair
from an ethics perspective,
from a discrimination perspective,
from all degree of fairness,
therefore we need human
supervision of these systems.
And it will not be explainable.
At any step of the pipeline
in which they made the decisions,
AI systems will not be
perfectly explainable
to the satisfaction of
us as human supervisors.
So there, again, human supervision
constantly will be required.
And the solution to this is a whole set
of techniques, whole set of ideas
that we're putting under the flag
of human-centered artificial intelligence,
human-centered AI, and
the core ideas there
is that we need to
integrate the human being
deeply into the annotation process
and deeply into the human supervision
of the real-world operation of the system.
So both in the training
phase and the testing phase,
the execution, the
operation of the system.
So this is what deep learning looks like
with the human out of the loop.
The human contributes to a learning model
by helping annotate some data,
and that data is then used to train
a lot of the hopefully generalized
in the real world in that model
makes decisions, and deep
learning's really exciting
because it's able to, in greater
and greater degree of autonomy
able to form high-level representations
of the raw data in a way
that it's actually able
to do quite well on certain kinds of tasks
that were before very difficult.
But fundamentally the human is out of loop
both of the training and the operation.
First you build the data
set, annotate the data set,
and then the systems run away with it.
They train on the data, and
the real-world operation
does not involve the human
except as the recipient
of the service the system provides.
Now, the loop version of that,
the human-centered version of that
means that annotation and
operation of the system
is both aided by human
beings in a deep way.
What does that mean?
So, we can look at a human
expert, so individuals,
and crowd intelligence,
the wisdom of the crowd
and the wisdom of the individual.
At the training phase,
the first part of that
is the objective annotation.
We need to significantly
improve objective annotation,
meaning annotation where
the human intelligence
is sufficient to be
able to look at a sample
and annotate it.
This is what we think
about is the ImageNet
and all the basic computer vision tasks
where a single human is enough to do
a pretty damn good job of determining
what's in the particular sample.
And then there's subjective annotation,
things that are difficult to humans
to determine as a singular
sample of a human being,
as the crowd will kind of converge
in these difficult questions.
These are questions at
a low level of emotion,
these things that are a little bit fuzzy,
that require multiple people to annotate.
And at the high level
are ethical questions
of decisions that an AI
system is tasked to making
or we're tasked to making
that nobody really knows
the right answer to
and as a crowd will kind of
converge in the right answer.
That's where the crowd
intelligence comes in
on the data annotation stint.
Now, in the operation,
once you train the model,
the supervision, again,
of the system based,
and I'll give examples
of this more concretely,
on the wisdom of the individual is,
for example, operating
an autonomous vehicle.
The supervision of that
autonomous vehicle,
a single driver is tasked with supervising
the decisions of that AI system.
That's a critical step for
a learning-based system
that's not guaranteed to be safe,
that's not guaranteed to be explainable.
And the subjective side of that,
where the crowd intelligence is required,
where a single person's
not able to make it,
these are, again, ethical questions
about the operation of autonomous systems,
the supervision of autonomous vehicles,
the supervision of systems
in the medical diagnosis
in medicine in general and this is
AI operating in the real world,
making ethical decisions
that are fundamentally
difficult decisions for humans to make,
and that's where the crowd
intelligence needs to come in.
And so we have to transform
the machine learning problem
by integrating the human being.
First up top in the training process,
on the left that's the usual
machine learning formulation
of a human being doing
brute force annotation
with some kind of data set,
cats and dogs in ImageNet,
segmentation data set in Cityscapes,
video action recognition
in the YouTube data set.
Given the data set, humans put
in a lot of expensive labor
to annotate what's going on in that data,
and then the machine learns.
The flip side of that,
the machine teaching side,
the human-centered side of that
is the machine instead,
the learning model,
the learning algorithm, we're talking
about most of the neural networks here,
is tasked with providing,
selecting the subset, the small,
sparse subsets of the data
that are most useful for
the human to annotate.
So instead of the human doing
the brute force task first
of the annotation, the
machine queries the human.
This is the field called machine teaching.
The machine queries a
human with questions,
and therefore the task is,
and this is a wide open research field,
the task is to minimize in
several orders of magnitude
the amount of data that
needs to be annotated.
In the real-world operation side,
the integration of the
human looks like this.
On the left, the machine, now
trained with a learning model,
makes decision and the
human living in this world
receives the service
provided by the machine,
whether that's medical diagnosis,
whether that's an autonomous vehicle,
whether that's a system that determines
whether you get a loan or not, so on.
With the human-centered version of that
the machine makes a decision,
but it's able to provide
a degree of uncertainty
is one of the big requirements,
to be able to specify
a degree of uncertainty
of that decision such
that when uncertainty
is below a certain threshold,
human supervision is sought.
And, again, in that decision,
whether that's a costly
decision financially
or costly decision in terms of human life,
human supervision is sought.
And the service is received by the human,
by the very same humans that
are providing the supervision
or another set of humans,
but ultimately the decision
is oversought by human beings.
This is what I believe is going
to be the defining mode of
operation for AI systems
in the 21st century,
is we won't be able to
as much as we'd like to escape,
to create perfect AI systems
that escape the need to work together
with human beings at every step.
There is five areas of
research, grand challenges here
that define human-centered AI,
I'll focus on a few today,
focus on one very much so.
And even with that degree of high pruning,
we have 120 slides, so I'll skip around.
But, on the human-centered
AI during the learning phase
there is the methods, the
research arm of machine teaching.
How do we select, how do we
improve supervised learning
as opposed to needing 10,000,
100,000, a million examples,
how do we reduce that where the algorithm
queries only the essential elements
and able to learn effectively
from very little information,
from very little samples?
Just like we do when we're students,
when we learn these
fundamental aspects of math,
the language and so on, we
just need a few examples,
but those examples are
critical to understanding.
And the second part of that
is the reward engineering,
that during a learning process,
injecting the human
being into the definition
of the loss function of
what's good, what's bad.
Systems that have to
operate in the real world
have to understand what our
society deems as good and bad,
and we're not always
good at injecting that
in the very beginning, that has to be
a continuous process of
adjusting the rewards,
of reward re-engineering by humans
so that we can encode human values
into the learning process.
Now, on the second part
on the human-centered AI
during real-world operation
when the system's actually trained,
that there is the interactive element
of robots and humans working together.
The part I'll focus on quite a bit today
because there's been
quite a lot of development
and progress on the deep
learning side is human sensing,
is algorithms that
understand the human being.
Algorithms that from
taking raw information,
whether that's video, audio, text,
begin to get a context,
a measure of the state
of the human being in the short term
and the long-term over time,
the temporal understanding
and the instantaneous understanding.
Then there is the interaction aspect,
so once you understand the
human is the perception problem,
you have to interact with them,
and interact in such a
way that is continuous,
collaborative, and a rich,
meaningful experience.
We're in the very early days of creating
anything like rich, meaningful
experiences with AI systems,
especially learning-based AI systems.
And the safety, in the
real-world operation,
safety, ethics, unrolling the results
of the engineered rewards
that were in place
during the learning process
now come to fruition,
and we need to make sure
that the trained model
does not result in things
that are highly judgmental,
catastrophic to our safety,
or highly detrimental
to what we deem as good
and bad in society,
of discrimination, of
ethical considerations
and all those kinds of things.
The gray area, the line
we all walk as a society
in the crowd intelligence.
We have to provide bounds on AI systems
and there's entire group of work,
I'll mention what we're
doing in that area.
So, first, on the machine teaching side
and the efficient supervised learning,
I'd like to sort of do
one slide on each of these
to kind of give you an idea.
Near-term, and do two things for each area
that we will elaborate
in future lectures on
and some of it I'll elaborate today.
First, the near-term
directions of research,
the things that are within our reach now
and a sort of thought
experiment, a grand challenge
that when if we can do it,
that would be damn impressive,
that will be a definition of
real progress in this area.
So near-term directions of research
for machine teaching, for
improved supervised learning,
integrating a human into
the annotation process
is instead of annotating brute-force
is annotate by asking the human questions.
So we have to transform
the way we do annotations,
where the process of
annotation is not defining
the data set and then you go
through the entire data set,
it's a machine teaching system
that queries the user for
questions to annotate.
And on the algorithm
side, active learning,
these are all sort of areas of work
where we could be more clever
about the way we use data,
select data on which to train.
So active learning is actively selecting
during the training process which part
of the data to train on and annotate.
Data augmentation is taking things
that have been supervised by a human
and expanding them, modifying the data,
warping the data in interesting ways
such that it expands.
It multiplies the human
effort that was injected
into helping understand
what's in the data.
The one shot learning, zero shot learning
are all, and transfer learning,
are all in that category,
and self-play is in the
reinforcement learning area
where the system constructs
a model of the world
and goes along alone in a room
and plays with that model
to try to figure out
the different constraints with the model,
how do you achieve good things there.
An example grand challenge here
that would define serious
progress in the field
is if we take ImageNet or COCO,
the ImageNet Challenge or COCO
Object Detection Challenge
and training only on a
totally different kind of data
be able to achieve
state-of-the-art results.
So training only on
Wikipedia with the text
and images that are there on Wikipedia
be able to perform object detection
on the state-of-the-art benchmark of COCO.
COCO's the data set of different objects
with rich annotation of the
localization of the objects.
That, I believe, is
exactly the kind of thing
that all the problems
in the transfer learning
and efficient data
annotation machine teaching
have to be solved to achieve that.
Another challenge you can think of,
if we can even just simplify it more
is achieve .3% error on MNIST,
that's the handwritten recognition task
that everybody always
provides as an example,
so achieve a very good accuracy,
state-of-the-art accuracy,
by training only on a
single example of a digit,
as opposed to training on thousands,
training on one example.
That's something that
most of us humans can do
given one example of a new
language you haven't seen before
for each character, after
studying them for a little bit,
be able to now classify future
characters at high accuracy.
The second part of the learning process
where the human needs to be injected
in the near-term directions
of research there
is the reward engineering
and the continuous tuning
of those rewards by a human being.
So OpenAI's doing quite
a bit of work here,
here's the game played by human and AI,
it's really my favorite example of this.
On the left, human is controlling a boat
that's finishing a race.
On the right is a RL agent,
reinforcement learning agent,
that's controlling a boat that's trying
to not finish a race, trying
to maximize their reward
defined prior to,
initially by a human being,
and what it finds is that you can get
much more reward by
collecting green turbos
that appear as opposed
to finishing the race.
It realizes that finishing the race
actually gets in the way
of maximizing reward,
and so that's the unintended consequences
of a reward function that
was specified previously,
and most human supervisors of this result
would be able to adjust,
re-engineer the reward function
to be able to get the robot to,
the AI system here to finish the race.
And that kind of continuous
monitoring, moni-deterring,
of the performance of the system
during the training process
is a near-term direction
of research that's a
few, DeepMind, OpenAI,
and ourselves are taking on.
Example grand challenge is allowing
AI system to operate in a context
where there's a lot of
fuzziness for us humans,
there's a lot of uncertainty,
there's a lot of gray area,
there's a lot of challenging aspects
in terms of what is
right and what is wrong
that we're continuing to improve on.
Example I provide here is one
of the least popular things
in the world is the US Congress.
So replacing US Congress,
it's a body of representatives
of the people of the United States
and they make bills based
on the belief of the people.
That sounds a lot like what Netflix does
in recommending what movie
you should watch next
in representing what people love to watch,
so that's just a recommender system.
So it makes perfect
sense that an AI system
should be able to take on this challenge.
And I see that as a grand challenge
is replacing some of the
fundamental representation
of large crowds of people
that make ethical decisions,
replaced by a human-centered AI system.
Okay, in real-world operation
the first thing we have to
do before we have a robot
and a human work together, the first thing
is the robot has to be perceive the human.
Question?
The conventional obvious wants you,
so currently there's a Congress,
I do want to change
the way Congress works,
make it better, but do you want
to just take the system that
currently is and automate it?
So the idea is take the system
as it currently is supposed
to be and automate that.
So an AI system can provide
a lot more transparency
of the inputs.
The idea of Congress is
supposed, the only inputs
is supposed to be the people
and the beliefs of the people.
And there's rich information there.
So, for example,
for me, not saying
anything about politics,
but there's certain issues
that I care a lot about
and certain issues that
I don't care much about,
and that's, put that aside,
and then there's certain
issues that I know a lot about
and certain issues I
know very little about.
And those don't actually
intersect that well.
I'm very opinionated about things
I don't know anything about,
it's very common, all of us are.
So being able to put
that representation of me
into a system that would take a lot
of our entire nation together and be able
to make bills that represent the people.
Now, the challenge there, it
can't be just the training set
and then the system now operates,
AI is running the country.
No, there has to be that
human-centered element
where we're constantly supervising,
just like we're in theory supposed
to be supervising our
congressmen and congresswomen.
Human sensing, the first part
in order to have an AI system
that works with a human being
the AI system has to perceive,
understand the state of the human being
at the very simplest level
and the more complex, temporal,
contextual over time level.
So the near-term directions of research,
it's purely the perception problem,
where deep learning
shines, of taking data,
whether that comes in visual,
audio, text, and so on
and being able to classify the physical,
mental, social state, social
context of the person.
Be able to, everything,
and this is what I'll cover
a little bit of today,
everything from face detection,
face recognition, emotion recognition,
natural language processing,
body pose estimation,
those same recommender
systems, speech recognition,
that all of those conversions of raw data
that capture something
about the human being
into actually meaningful,
actionable information.
A grand challenge there
is emotion recognition.
There's been a lot of companies
and ideas that we've somehow
cracked emotion recognition,
that we are able to determine
the mood of a person,
but really that's, for those
who were here last year
with Lisa Feldman Barrett,
but just if you're sort of very honest
and you study emotional intelligence
and emotion and the expression of emotion,
it's a fascinating area
and we're not even close
to being able to build perception systems
that detect emotion.
What we're more so doing is detecting
very simple facial expressions
that correspond to our
storybook versions of emotions,
smiling, crying, like
frowning in a caricatured way.
So if you build a system
that has a high accuracy
of doing real emotion recognition
you can think of it, as stated here,
an AI system that classifies,
binary classification
problem with 95% accuracy
of whether you want to
be left alone or not.
And being able to do that after
collecting data for 30 days.
That I see as a really clean formulation
of exactly the kind of human understanding
we need to be able to build
in our learning models,
and we're very far away from that,
especially the long
temporal aspect of that,
of being able to integrate data
over a long period of time.
Then the second part of
human-robot interaction
in the real world operation
is the experience.
This is where knowledge is beginning
to consider that interactive experience
of how do we have a rich,
fulfilling experience?
We have autonomous vehicles, for example,
semi-autonomous vehicles,
whether that's Tesla, Volvo,
Super Cruise with the Cadillac,
there's a bunch of systems that have now
greater and greater degrees
of automation in the car
and we get to have the human
interact with that AI system
and trying to figure out how do we have
a rich, fulfilling experience in the,
currently the Volvo system that experience
is more limited, there's a little icon,
it's more kind of traditional
driving situation.
And the Tesla, you have
a much bigger display
about what's going on.
In the Super Cruise there's a
camera looking at your eyes,
in the Cadillac Super Cruise system
there's a camera looking at your eyes
determining if you're awake or not,
paying attention or not, and there's like
an experience there that
we're trying to create.
And in the Tesla case,
the miles are racking up,
we have real data.
Here at MIT we're studying
this exact interaction,
there's now over a billion
miles driven in the Tesla.
And the same in the fully autonomous side
with Waymo have now reached
10-plus million miles
driven autonomously.
And there's a lot people
experimenting with this,
but that's that collaborative interaction
of going back and forth, of being able to,
for the AI system to express the degree
of uncertainty as a
part of the environment,
the AI system being able to
express when it needs help
and not be able to communicate
what are its limitations
and capabilities and so on,
trade off control, be able
to seek human supervision.
There's a dance there that's
takes into consideration
everything from the
neurobiological research
to psychology, to deep learning
and to the pure robotics,
HRI, human-robotic interaction aspects.
One grand challenge would be,
Tesla's driven one billion
miles now under Autopilot,
under the semi-autonomous mode.
The grand challenge here
is when we start getting
to the kind of mileage that we see
in the United States every year.
We start getting into
the hundreds of billions
of miles driven semi-autonomously.
We get to see teenagers, 16, 17, 18,
using these systems for the first time.
We get to see older folks,
folks who don't necessarily
drive or use any kind of AI in their lives
get to use these systems.
We start to explore that aspect,
that's the real challenge.
And, of course, the old Turing test,
now reimagined by Alexa with
the Alexa Prize Challenge
of Socialbot is creating, natural language
is such a beautiful thing to explore
human-robot interaction with
is both on the audio side
and just the text side
is passing the Turing test.
That's the true grand
challenge in the real way
where you want to have a
conversation with a robot
for prolonged periods of times,
maybe more than even some
of your other friends.
And on the other side
of friends is the risk,
the catastrophic risk that's potential
when you have an AI system
that's learning from data.
The near-term directions of research
is purely the human
supervision of AI decisions
in terms of safety and ethics.
There's a lot of systems like with cars
or medical diagnosis and so on
where there's some life-critical,
safety-critical aspect
that we want to be able to
supervise the safety of that,
and there's ethical decisions in terms
of who gets the loan or not,
who gets a certain
criminal penalty or not,
any degree to which AI systems
are incorporated into that
you have to consider ethical questions.
And even just the crude, the
low-level perception systems
like face recognition,
you want to make sure that
your face recognition systems
are not discriminating based
on color or gender or age
and so on, you want to make sure
that at that basic
fundamental level of ethics
the systems are trained in a way
they maintain our human values
or the better angels of our nature,
the better sides of our values,
some of the brighter
aspects of our values.
And the other thing is in terms
of just maintaining values
that's the normal, that's
looking at the mean
of the distribution, but we also want
to control the outliers,
from the AI systems
not to do anything catastrophic.
So the unintended consequences
when something happens
that you didn't anticipate,
you want to be able to
put boundaries on that.
And the grand challenge there,
really it all boils down to the ability
of an AI system to say that
it's uncertain about something.
And that measure of
uncertainty has to be good.
It has to be able to make a prediction
always accompanied with uncertainty,
even on things he hasn't seen before,
that's the real challenge.
To be able to be trained on cats and dogs
and then seeing a giraffe
and saying I'm not sure what that is.
We're quite far away from that,
'cause right now it would
probably confidently say
it's a dog, depending on the giraffe.
But we want to be able to have
an extremely high accuracy
in the ability of AI systems to determine
their own uncertainty, to
know what they don't know,
because from that comes the supervision,
from that comes the ability to stop
under things that it's uncertain about,
catastrophic events.
The first aspect of real-world operation
is understanding the human.
One of the places where deep
learning has really shined
is the perception problem.
It all begins at the
ability to look at raw data
and convert that into
meaningful information,
that's really the understanding
of the human comes in.
Not the kind of understanding that
when you're in a
relationship with somebody,
when you're friends with somebody,
over a long period of
you gain an understanding
of their quirks, limitations,
capabilities, so on,
that's really fascinating.
But the first step is just to be able to
when you see them, recognize who they are,
what's on their mind,
what's the body language,
what are they saying with their mouth?
All those basic raw perception tasks,
that's where deep learning really shines.
I like to cover the state-of-the-art
in those various perception tasks.
So, first, face recognition.
Now, there's a full slide
presentation with this
and I'm skipping around.
The full slide presentation
has the following structure
for each of these topics.
It has the motivation, description,
the excitement, the
worry, the future impact
is the first part, and
then there's five papers.
One defining the quote unquote
old-school seminal work
that opened the field, then the
early progress in the field.
And paper three is the
recent breakthrough,
often associated with deep learning.
Paper four is the
current state-of-the-art.
And paper five is the thing that defines
the future direction, the possible set
of things that define
the future direction,
and then the open problems in the field
and where the future
research is very much needed.
That's kind of the
structure of every topic
I'll cover here as quickly as possible.
Face recognition, so what is it?
It's the first thing,
the face contains so much rich information
about the state of the human being.
So understanding the human being
really starts at the face.
And detecting the face is the first step,
detecting the body and that there's a head
at the top of that body,
that's the first step.
And then there is the
task of face recognition,
being an exceptionally
active area of research
because it has a lot of applications,
and through that research we're able
to now study a lot of aspects,
how we perform perception on the face.
So, recognition, purely stated,
is the recognizing the
identity of a human face,
who is this.
Detection is just detecting a face.
Now, recognition means there's
a database of identities,
what is it, seven
billion of them on Earth,
and you're trying to
determine which of them it is,
which of the seven billion it is,
or whatever the database is.
The face verification problem
is something that your phone uses
when you unlock it with your face
is it's saying, is it your or not,
is it Lex or somebody else?
It's a database of two, one
person versus everybody else.
There's a lot of
applications here obviously,
from identification to
all the security aspects
of using the face as a sort of fingerprint
of your identity in all
the interactive elements
of AI systems, software-based
systems in this world.
Okay, so why is it hard?
So all the usual computer
vision problems come in,
lighting variation, pose variation,
it's just computer vision's very hard,
it's just you get these raw numbers
and you have to infer so many things
that us humans take for granted,
so the basic computer vision stuff.
But there is stuff on top of that.
So, faces, we're trying to,
it's like cats versus dogs,
there's thousands of breeds of dog
and thousands of breeds of cats,
in that same way, faces can
look very similar to each other.
So these two classes that
you're trying to separate,
could be very, very, very
close together and intermingle.
Now, there is a lot of
face data available now,
because of the application,
because of the financial
benefits of such data sets,
but for any one individual,
unless you're Brad Pitt
or Angelina Jolie or a celebrity,
there's not many samples
of the data available,
so the individuals, based
on which the classification
is to be made, there's
often not very much data.
Then there is a lot of variation,
so you have to, in making
the face recognition task,
you have to be invariant
to all the hairstyles,
that you change yourself over time,
the weight gain, the weight loss,
the beard you decided to grow,
the glasses you wear
sometimes and not others,
the different styles of glasses and so on,
makeup or no makeup, all of these things,
it's still you, still the same identify,
you have to be able to classify that.
And the kind of accuracy,
especially for security applications
is extremely high if it's required.
The reason it's an exciting area
is there's a lot of possibility
and there's also a lot of concern, right?
So the future impact, utopia, dystopia,
and the more reasonable middle path here
is face provides a very user-friendly way
of letting your devices
recognize you and say hello.
Your voice is certainly one,
but one of the most powerful ones
to really classify at at distance is face.
So what does that mean?
The utopian view, the
possibility of the future,
the best possible,
brightest possible future,
is you can use your face as a passport,
you replace the license, replace
all the security measures
we put from the passwords in our devices
to the credit card and so on, all of that,
Apple Pay, it will be Face Pay.
You show up, it will automatically connect
to all your devices, all your
banking information and so on.
Obviously the flip side of that,
just rephrasing that sentence
is also could be dystopian,
because complete violations of privacy,
being watched at any time,
being able to, through your Facebook
and social media and all your devices,
being to able to identify you,
making it impossible for you
to sort of hide from society.
The fundamental aspects of privacy,
maintaining privacy,
many of us value greatly.
The middle path is
really just a useful way
to unlock your phone.
The recent breakthroughs here,
it started with DeepFace.
An essential idea there is
applying deep neural networks
to the task of face recognition.
With a lot of the breakthroughs here
on the perception side we're not covering
the old-school papers and so on
and the historical context here,
biggest breakthroughs came
with deep learning, 2006,
'07, '08, the last 10 years.
The same is true with face recognition.
DeepFace was the big first application
that achieved near-human performance
on one of the big benchmarks of the time
on the Labeled Faces in the Wild.
So they're using a very large data set,
being able to form a good representation.
The state-of-the-art, or at least close
to the state-of-the-art is FaceNet.
The idea there is using
those same deep architectures
to now optimize for the
representation itself directly.
The notebook we're putting out,
we shared with some of
you for the assignment,
describes face recognition,
the challenge there,
that it's not like the traditional
classification problem.
You have to form an embedding of the face
into a small vector, compressed vector,
such that in that embedding,
faces that are similar to each other,
so identities that are close together
are close in the Euclidean
sense in that embedding
and people that are very
different are far away.
And so you use that embedding
to then do the classification.
That's really the only
way to deal with data sets
for which you have so little information
on any one individual person.
And so FaceNet optimized that embedding
in a way that directly optimizes
for the Euclidean distance
between non-matching identities.
So there's still a lot of excitement
about face recognition,
there's a lot of benchmark competitions
and a lot of people working in this,
and really bigger, badder networks
and more data is really one of the ways
to crack this problem.
So a public large data set
with 672,000 identities
and 4.7 million photos, that's in 2017,
and that just keeps scaling
up and up and up and up.
Now, we have to also be honest here
on the possible future directions of work
in that even though the
benchmarks are growing,
that's still a tiny subset
of the people in the world.
We're still not quite there
to be able to have the
general face recognition
applicable to the
entirety of the population
or a large swath of the
population of the world.
So in this topic here, brief coverage,
we're not covering all of
the aspects of the face,
especially temporal that are
useful in face recognition
or useful saying a lot
of things about the face,
which is a face yields facts.
The different kinds of facial expressions
that can then be used to
infer emotion and so on,
you know, raised eyebrows
and all those kinds of things
and can provide rich
information for recognizing
and interpreting the face.
And the different other modalities,
including 3D face recognition,
we're not covering.
There's a lot of exciting areas there.
We're just looking at the pure formulation
of the face recognition problem
of looking at a 2D single image.
The open problems here is,
first, not often stated and
misinterpreted by people,
is that most of these
methods of face recognition
start with assuming that you have
a bounding box around the face.
Now, oftentimes recognition can happen,
so they're assuming a frontal
or near-frontal view of the face.
But you can do recognition
of all kinds of poses,
and it's very interesting to think
that recognition, the way
we recognize our friends
and colleagues, parents and children,
is often using a lot of
cue context information
that's beyond just the pure
frontal view of the face.
And you can do pretty
well on profile views,
taken from body language and so on.
So all those things,
that's open in the field
how we incorporate that
into face recognition.
Then the black-box side is problematic
for both bias and just being able
to understand why incorrect
decisions are made
is making those face recognition
systems more interpretable.
And then finally privacy,
the ability to collect
the kind of data where
the face recognition
would be performing extremely well
and yet not violating
the fundamental aspects
of privacy that we value.
Activity recognition, taking
the next step forward here
into the richer temporal
context of what people do.
Again, the same structure
from recent breakthroughs
to the future direction of work.
What is it?
It's classifying human activity
from images or from video.
And why is it important is it provides,
depending on the level of abstraction
for the activity, it provides context
for understanding the human.
What are they doing, are
they playing baseball,
are they singing, are they sleeping,
are they putting on makeup, knitting,
so on, mixing butter?
Why is it hard?
Again, all the usual problems
in image recognition.
The kind of data we're dealing
with is just much larger,
the kind of video, the
richness of possibilities
that define what activity
is is much larger,
so the complexity's much larger.
It's often difficult to quantify motion
because the fundamental aspect of activity
is the change in the world,
is the motion of things,
and then it's difficult to determine
how the dynamics or the
physics of the world,
especially from a 2D view of
what's background information,
what's noise, and what's essential
to understanding the activity.
And the subjective, ambiguous
elements of activity.
When does a particular activity
begin, when does it end?
What's all the gray areas
when you're partially engaging
in that activity and so on?
When you start to annotate these things,
when you start to try to do the detection
it becomes clear that
sometimes the activity
is partially undertaken and the beginning
and the end is fuzzy.
Future impact, utopia,
dystopia, and middle path.
So the impact here comes from being able
to understand the world in time
and be able to predict.
The utopian possibilities
is that the contextual perception
that can occur from here can enrich
the experience between
the human and robot.
The dystopian view, the flip side,
is being able to understand
sort of human activities
can let the robots
severe the relationship.
So it can damage the
human-robot interaction
to where they just do their own thing.
The middle path is just
finding useful information
in massive amounts of data like YouTube.
Now there's a YouTube video data set,
being able to identify what's
going on in this video,
be able to infer rich,
useful semantic information.
And so what do we do with video,
how do we do perception in video?
Now, the recent breakthrough
came with deep learning
and C3D, this 3D convolution
of neural networks
that take a sequence of images
and they're able to determine the action
that's going on and intent,
what's going on in the video.
That was a recent breakthrough.
The state-of-the-art
coming from a slightly,
well, from a different architecture
that takes in two streams,
one is the image RGB data,
the other is optical flow data
that's really focusing on
the motion in the image,
those are the two that's opened the wave
of two-stream networks.
Here from that paper showing
the different architectures.
On the far right is the
two-stream architecture
and the C3D shown under B here,
taking the sequence of images.
But all these are just
different architectures,
and then the first one is LSTM's,
there's different
architectures of how do you
allow a network, how do
you allow a learning model
to be able to capture
dynamics in the data?
The future possibilities, it has to do,
well, literally with the future,
being able to take single images
or sequences of images
and predicting the future.
It's very interesting to think about,
in our ability to hallucinate the future
and generate the future from images,
you start to think about what are
the defining qualities of activities
and in this way augment data and be able
to train much more accurate
action recognition systems.
Topics not covered is the
localization of activity in video,
so action recognition purely defined
is I give you a clip and you tell me
what's going on on this clip.
Now, if you take actually
a full YouTube video,
you want to be able to localize,
find all the times when a
particular activity is going on.
It could be multi-label,
multiple activities going on
at the same time, beginning
and ending asynchronously.
And then there is more
richly three-dimensional
or 2D classification of activity
based on human movement.
So looking at like from a
Kinect, from 3D sensors,
looking at skeleton-based
action recognition
from sensors that provide you more
than just the 2D image data.
The open problems is that
it's activity recognition is more
than just the way we move our body
or if it's baseball,
like a ball in your hand
and hitting it with a baseball bat.
It also has to do with context.
There's sitting down or working
or looking at something
and picking up an item.
Those sometimes can change profoundly
based on the other objects in the scene,
on the activity of other
people in the scene.
And so being able to work
with that kind of context
is a totally open problem.
It's having to reduce a very complex
real-world context into something
where you can clearly
identify an activity.
Body pose estimation is the task
of localizing the joints
that form the skeleton
of the human body.
So infer from visual information
the positions of the different joints.
Along the line of complexity,
it's important in being able to understand
the body language, the rich information
about the body of the human being.
So that's from reading body language
to animation to aiding
activity recognition,
and it is just the useful representation
of the human body if you're
analyzing pedestrians
or interactive environments,
human-robot interaction,
being able to understand
what the heck it is
the human is trying to do,
a body pose is really useful.
It's hard because the body,
when you look at a 2D image
projection of the body
there's a lot of, it's
a highly dimensional
optimization problem figuring out
how the raw pixels match to
the actual three-dimensional
orientation of the human joints.
And the usual computer vision challenges
of pose, lighting, and so on.
Future impact is it's really exciting
for interactive environments
for our robot to be able to know
the position of the human body,
whether it's just trying to interact,
whether it's a robot that's trying
to get their favorite human a beer
or whatever your choice of,
favorite choice of drink,
you have to be able to
find where their hand is
so you can do the trade off.
Same thing in the car,
you have to determine
if the person's hand's
on the steering wheel,
if their head and orientation is such
that they're able to physically
take control of the vehicle.
That's a really exciting
set of possibilities there.
And there's applications in sports and CGI
in video games and all aspects
when the robot and human
have to work together.
The dystopian view, you can imagine,
is, of course, being able
to localize all those joints
means robots that are able to
more effectively hurt humans,
and so that's always a huge concern
and always a dark
dystopian view of the world
with so much AI in it.
Of course, the reality is it's just more
rich, fulfilling HCI that takes advantage
of not just the face,
stuff coming from the face,
but also stuff about the body of the human
that the robot is interacting with.
So it started with deep
learning being applied
to the body pose estimation problem,
2014 with DeepPose.
The key ideas there is looking
at the holistic human
pose estimation problem
of detecting all the different joints
of a single person in an image.
Power of deep learning is that you
no longer have to do hand crafted,
expert-engineered features,
that it automatically
determines the set of features,
all the parts are being detected for you.
So this highly complex problem
is all solved with data.
This is the state-of-the-art
of the 2017 and beyond there's been
a few papers from CMU along this line
is doing real-time multi-person
2D pose estimation,
but in a bottom-up way
where you're detecting
individual joints first.
So all the knees in the picture,
all the elbows, all the shoulders,
all the wrists, and so on,
and then stitching them together
using parts affinity fields.
So if you find 17 elbows in a picture,
you then have to try to see which elbow
belongs to which person.
So that actually turns out
to be extremely powerful
way to detect, especially multi-pose,
especially to deal with occlusions
way of detecting body pose.
It's really interesting
and also is able to,
because of that, because of the
separation of the detections
is able to run in real-time,
which is also really exciting.
Possible future direction is
using much more information,
using deformable models of the human body,
so not just the skeleton,
rich volumetric information
to do the detection and then optimizing
for what's the most likely
orientation of the body.
The open problems in the field
is the fact that pose is not a thing
that happens in a single image,
pose that happens is
part of human behavior
and part of movement through time.
So here Monty Python
Ministry of Silly Walks,
people walking funny ways,
so we collect a lot of data on pedestrians
and can tell you that people
walk in different ways
and people position their
body in different ways.
And so the temporal aspects
of human emotion are
for the most not incorporated
in the body pose estimation
problem and they should be.
There's a lot of exciting possibilities
of capturing the temporal dynamics.
There's a lot of awesome slides here
that I'm just skipping through.
Speech recognition.
2018 was really big for
Recommender systems for Netflix,
OkCupid.
AI for president.
Each one of the things I
mentioned briefly today
will have a separate mini lecture.
I taught an entire course
on this at CHI last year.
So deep learning for
understanding the human
is a topic I'm really excited about,
because it's really the first step
for a machine to be able to interact
in a rich way with a human
being, is to understand it.
And it's also area where the most
near-term impact can happen,
a system to be able to effectively detect
what a human being is up to,
what they're thinking about,
how to best serve them
and enrich the experience
of interacting with that human.
Let me jump to AI safety
and then the interactive
experience with humans and robots
to just give examples of
some work in that direction
and some research in that direction
I'm really excited about.
So AI's safety, at the very basic level
there is an AI system
that's making decisions
where we want human beings
to supervise those decisions.
We've done quite a bit of work
here at MIT on that aspect
of supervising machines
with arguing machines,
and OpenAI has done work with safety
by having machines debate each other.
So this kind of idea that
you can achieve safety
by not giving ultimate power
to any one decision maker.
And the disagreement that emerges
from two AI systems or
multiple AI systems,
having to make decisions
and agree with each other,
it allows us to then produce
a signal of uncertainty
based on which the human
supervision could be sought.
Without that, when we
have a state-of-the-art
black box AI system that does
something like drive a car,
all we have is a system that just runs
and we're supposed to have faith
that it's always going to be right.
We don't have any uncertainty signal
coming from the system.
So the idea of arguing machines
that we've developed and been working on
is to have multiple AI system,
an ensemble of AI systems
where the disagreement,
when there's a disagreement detected
human supervision is sought.
And the idea there is that
when you have a system
like Tesla Autopilot,
here we have instrumented
a Tesla vehicle.
When you have a system
like Tesla Autopilot,
it's telling you nothing
about uncertain it is
about the decision it's making,
it just knows once the system is on
it's now steering the car for you.
And in very rare cases
this is just disengage.
But no matter what,
it's not showing to you
the degree of uncertainty it has
about the world around it.
And so the way create
that signal of uncertainty
is by adding another,
in this case end-to-end
vision system that's looking
at the external environment
and making steering decisions,
and whenever there's a disagreement
between the two detected, that's when
human supervision is sought.
And we can predict in this way,
shown in the plot there is we can predict
with high accuracy the
times when the driver
chose to disengage the system
because they were uncomfortable.
So you're detecting,
you're using this mechanism
to detect risky, challenging situations.
It's an idea about how we supervise AI
by having multiple AI systems
that independent and
through their disagreement
emerges the uncertainty signal.
And we can apply this like the AI folks
in natural language would debate,
we can apply in computer vision as well,
taking two independently trained
but on the same training set networks,
ResNet and VGGNet trained on ImageNet,
and we can have them argue,
and in the process improve
significantly the accuracy.
So in the case of ResNet
is an architecture,
VGGNet is an architecture
trained on the ImageNet training data set.
They separately have a certain error,
ResNet has an error of 8%,
VGG-16 has an error of 10%.
When we apply the arguing
machines framework,
when the disagreement
is brought to the human,
that error rate decreases to 2.8%.
Now, this is just ImageNet challenge,
but if that error meant
the loss of human life,
this kind of framework is really powerful
for overseeing the
operation of the AI system.
Yeah, just examples where they disagree.
So taking this image that's from ImageNet.
The ground truth is a wine bottle
and ResNet prediction is that it's a,
definitely, .93, 93% confidence
that it's a paper towel,
and VGGNet, 25% confidence
that it's a seat belt.
So these disagreements are then brought in
and then we, the fact that they disagree
arises the uncertainty
and then human supervision
is brought and then humans are able
to annotate correctly what's
going on in this picture.
Same thing here, mailbox,
the ground truth is a mailbox.
Again, the two architectures disagree.
One says traffic light, the
other one says garbage truck.
For an autonomous vehicle you can imagine
this being problematic,
if that's a traffic light
you might stop for this
mailbox, that kind of thing.
That's early research in the field
of how do we have AI systems that are more
and more powerful, we can
also inject human effort
to supervise when it's needed.
The one that's needed part,
the uncertainty signal,
is the critical thing.
So we have to figure out ways to create
that uncertainty signal.
The subarea of just creating
a rich human interaction.
- [Black Betty] Lex,
you appear distracted,
would you like me to take over?
- So this is, we're doing a lot of testing
with autonomous vehicles
here, I'm tweeting.
- [Black Betty] Great, I am taking control
of steering and braking.
- So we have a human-centered
autonomous vehicle here
at MIT that's taking control back
and forth from the human
based on the activity.
It's now in control.
That's just me explaining the video.
The point is that the driving experience,
the human-robot interaction experience
should be fun and awesome
and enriching to life,
and that's why you would want
to use these kinds of systems.
We have a bunch of videos online,
you can check 'em out,
including ridiculous one
of me playing guitar, and there's a paper
along with this, describing
different principles
of how we have humans robots work together
in this kind of way.
There's a lot of totally
untouched problems in that space.
Most of the robotics community
and the machine learning
community approaches AI as a system
that we want to make perfect.
And once it's perfect, we want to then
put it in the real world where us humans
get to interact with it.
Just like, what is it, Robin
Williams in Good Will Hunting
talking about relationships,
that nobody's perfect.
The way I foresee it, AI
systems will not be perfect
for the next hundred years,
and so we have to have humans
and AI system work together
and optimize that problem,
solve that problem,
that both of us are flawed,
but together there's
something enriching to both.
As I mentioned, the videos here
will be available online.
The lectures underlying
all the deep learning
for understanding the human
and underlying the five principles here
of human-centered AI,
and it's a area of active
research here at MIT
and globally, and it's one that
I'm extremely passionate about.
And one of the analogies
that I think about
when I think about the success
of artificial intelligence systems
as an analogy of parasitism and symbiosis.
A lot of the ways we're
training machine learning
algorithms now is we inject
a lot of human labor,
a lot of really costly human
labor separately offline
out of the loop in order to
improve the learning models
through brute-force annotation.
And what I see is success in the future
requires that the learning is done,
the models improve in a symbiotic way,
as a side effect of
interacting with humans.
This is done a lot now
in reinforcement learning
through game playing and so on,
but the human computation,
the human effort of annotation
is something that happens
naturally through interaction,
not a costly thing you have to pay for.
Because when it happens
naturally in a symbiotic way
we can increase scale,
we can scale learning
to a degree that's required to solve
some of the real-world problems.
And that also requires solving
a lot of aspects of
human-robot interaction,
from understanding our own brain,
from the biological to the
electrical and neuroscience,
to the behavioral aspects captured
by cognitive science,
psychology, sociology,
to the mathematical
formulations of behavior
and game theory, to when you take
that human behavior and
put it in the real world
with engineering systems,
human factors and design.
These are all giant
subfields with conferences
and papers, that all of
them need to work together.
And then on the computer science side
with natural language processing,
understanding language, the
human-robot interaction,
the human-computer interaction,
and just the interfaces, how does
and what does the computer,
the robot show to you?
Again, entire conferences.
And then the exciting
aspects of learning from data
and deep learning and
learning to act from data
and reinforcement learning,
deep reinforcement learning.
And then the robotics is
actually building these things,
building the hardware,
again, an entire area,
exciting field of research.
All of them have to work together
to create systems here that integrate
the human during the learning process
and integrate the human
during the operation process.
So the video's on deeplearning.mit.edu,
videos and slide available there,
code is available there.
With that, I'd like to
thank you very much.
(audience applauds)
