MALE SPEAKER:
Welcome, everybody,
to one more Authors
at Google Talk.
Today, our guest speaker
is Pedro Domingos,
whose new book is called
"The Master Algorithm."
We have it here and you
can buy copies outside.
So one definition
of machine learning
is "the automation
of discovery."
Our guest, Pedro Domingos,
is at the very forefront
of the search for the master
algorithm, a universal learner
capable of deriving all
knowledge, past, present
and future, from data.
Pedro Domingos is a
professor of Computer Science
and Engineering at the
University of Washington.
He's the co-founder of the
International Machine Learning
Society.
Pedro received his MS in
Electrical Engineering
and Computer Science
from IST in Lisbon,
his Master's of Science
and PhD in Information
and Computer Science from
the University of California
at Irvine.
He spent two years as an
assistant professor at IST
before joining the faculty of
the University of Washington
in 1999.
Pedro is the author or
co-author of over 200
technical publications in
machine learning, data mining,
and other areas.
He is the winner of the SIGKDD
Innovation Award, the highest
honor in data science.
He's an AAAI Fellow and has
received the Sloan Fellowship
and NSF Career Award, a
Fulbright scholarship, an IBM
Faculty Award, several
best paper awards,
and other distinctions.
He's a member of the editorial
board of "The Machine Learning
Journal."
Please join me in welcoming
Pedro, today, to Google.
[APPLAUSE]
PEDRO DOMINGOS: Thank you.
Let me start with a
very simple question--
where does knowledge come from?
Until very recently, it came
from just three sources, number
one, evolution-- that's
the knowledge that's
encoded in your
DNA-- number two,
experience-- that's
the knowledge that's
encoded in your neurons--
and number three, culture,
which is the
knowledge you acquire
by talking with other people,
reading books, and so on.
And everything
that we do, right,
everything that we are basically
comes from these three sources
of knowledge.
Now what's quite extraordinary
is just, only recently,
there's a fourth source of
knowledge on the planet.
And that's computers.
There's more and more knowledge
now that comes from computers,
is discovered by computers.
And this is as big of a
change as the emergence
of each of these four was.
Like evolution, right,
well, that's life on earth.
It's the product of evolution.
Experience is what distinguishes
us mammals from insects.
And culture is what
makes humans what we are
and as successful as we are.
Notice, also, that each of these
forms of knowledge discovery
is orders of magnitude
faster than the previous one
and discovers orders of
magnitude more knowledge.
And indeed, the same thing
is true of computers.
Computers can discover
knowledge orders of magnitude
faster than any of these
things that went before
and that co-exist with them
and orders of magnitude more
knowledge in the
same amount of time.
In fact, Yann LeCun
says that "most
of the knowledge in
the world in the future
is going to be
extracted by machines
and will reside in machines."
So this is a major change that,
I think, is not just for us
computer scientists
to know about and deal
with, it's actually
something that everybody
needs to understand.
So how do computers
discover new knowledge?
This is, of course, the
province of machine learning.
And in a way, what I'm going
to try to do in this talk
is try to give you a sense
of what machine learning is
and what it does.
If you're already familiar
with machine learning,
this will hopefully give you
a different perspective on it.
If you're not familiar with
machine learning already,
this should be quite
fascinating and interesting.
So there are five main
paradigms in machine learning.
And I will talk about
each one of them in turn
and then try to step back and
see, what is the big picture
and what is this idea
of the master algorithm.
The first way computers
discover knowledge
is by filling gaps in
existing knowledge.
Pretty much the same way
that scientists work, right?
You make observations,
you hypothesize
theories to explain
them, and then
you see where they fall short.
And then you adapt
them, or throw them away
and try new ones, and so on.
So this is one.
Another one is to
emulate the brain.
Right?
The greatest learning
machine on earth
is the one inside your skull,
so let's reverse engineer it.
Third one is to
simulate evolution.
Evolution, by some standards,
is actually an even greater
learning algorithm
than your brain
is, because, first of
all, it made your brain.
It also made your body.
And it also made every
other life form on Earth.
So maybe that's something
worth figuring out how it works
and doing it with computers.
Here's another one.
And this is to realize that all
the knowledge that you learn
is necessarily uncertain.
Right?
When something is
induced from data,
you're never quite
sure about it.
So the way to learn is to
quantify that uncertainty using
probability.
And then as you
see more evidence,
the probability of different
hypotheses evolves.
Right?
And there's an optimal way to
do this using Bayes' theorem.
And that's what
this approach is.
Finally, the last
approach, in some ways,
is actually the simplest and
maybe even the most intuitive.
It's actually to just
reason by analogy.
There's a lot of
evidence in psychology
that humans do
this all the time.
You're faced with
a new situation,
you try to find a matching
situation in your experience,
and then you
transfer the solution
from the situation
that you already
know to the new situation
that you're faced with.
And connected with each of
these approaches to learning,
there is a school of
thought in machine learning.
So the five main ones are the
Symbolists, Connectionists,
Evolutionaries, Bayesians,
and Analogizers.
The Symbolists
are the people who
believe in discovering
new knowledge
by filling in the
gaps in the knowledge
that you already have.
One of the things that's
fascinating about machine
learning is that the
ideas in the algorithms
come from all of these
different fields.
So for example, the Symbolists,
they have their origins
in logic, philosophy.
And they're, in some sense,
the most "computer-sciency"
of the five tribes.
The Connectionists,
their origins
are, of course, in
neuroscience, because they're
trying to take inspiration
from how the brain works.
The Evolutionaries,
well, their origins
are, of course, in
evolutionary biology,
in the algorithm of evolution.
The Bayesians come
from statistics.
The Analogizers
actually have influences
from a lot of different fields,
but probably the single most
important one is psychology.
So in addition to being very
important for our lives,
machine learning is also
a fascinating thing,
I think, to study, because in
the process of studying machine
learning, you can actually study
all of these different things.
Now each of these "tribes" of
machine learning, if you will,
has its own master algorithm,
meaning its own general purpose
learner that, in principle,
can be used to learn anything.
In fact, each of these
master algorithms
has a mathematical
proof that says,
if you give it enough data,
it can learn anything.
OK?
For the Symbolists, the master
algorithm is inverse deduction.
And we'll see, in a
second, what that is.
For the Connectionists,
it's backpropagation.
For the Evolutionaries,
it's genetic programming.
For the Bayesians, it's
probabilistic inference
using Bayes' theorem.
And for the Analogizers, it's
kernel machines, also known
as support vector machines.
OK?
So let's see what just the key
ideas in each one of these are.
So the Symbolists--
here are some
of the most prominent
Symbolists in the world.
There's Tom Mitchell
at Carnegie Mellon,
Steve Muggleton in the UK,
and Russ Quinlan in Australia.
And their idea is actually
a very interesting one.
It's to think of
deduction-- sorry,
it's to think of learning as
being the inverse of deduction.
Learning is induction
of knowledge.
Right?
Deduction is going from general
rules to specific facts.
Induction is the opposite.
It's going from specific
facts to general rules.
So in some sense, one is
the inverse of the other.
And so maybe we
can figure out how
to do induction in the same
way that people in mathematics
figure out how to do
other inverse operations.
Like, for example, subtraction
is the inverse of addition,
or integration is the inverse of
differentiation, and so forth.
So as a very, very
simple example,
addition gives you the
answer to the question,
if I add 2 and 2, what do I get.
Subtraction-- and the
answer, of course, is 4.
And this is the deepest thing
I'll say in this whole talk.
And subtraction, of course,
gives you the answer
to the inverse question,
which is, what do I
need to add to 2 in order to
get 4, the answer, of course,
being 2.
Now inverse deduction works
in a very similar way.
So here's a simple
example of deduction.
You know that Socrates
is human and you
know that humans are mortal.
And the question is, what
can you infer from that.
Well, of course, the
answer is that, from that,
you can infer that
Socrates, too, is mortal.
Now the inverse of
this-- and that's
when it becomes induction--
is, if I know that Socrates
is human, what else do I
need to know in order to be
able to infer that he's mortal.
And of course, what I need to
know is that humans are mortal.
And so in this way, I have just
introduced a new general rule
that humans are mortal.
Of course, in general,
I wouldn't just
do it from Socrates,
I would do it
from Socrates and a
bunch of other people.
But that's the general
way that this works.
OK?
And then once I've
induced rules like this,
I can now combine them in
all sorts of different ways
to answer questions that I may
never even have thought of.
And this kind of
flexibility and composition
is actually something that,
of all the five tribes,
only the Symbolists have.
Now of course, these
examples are in English
and computers don't understand
natural language yet.
So what they use is something
like first order logic.
So these things, both
the facts and the rules
that are discovered,
are represented
in first order logic.
And then questions are answered
by chaining those rules,
by reasoning with them.
OK?
But whether it's in logic
or natural language,
the principle is the same.
OK?
And as I said, of all
the five paradigms,
this is the one that is most
like scientists at work.
Right?
They figure out, where are
the gaps in my knowledge.
Let me enunciate a
general principle
that will fill that gap.
And then let me
see what follows.
Let me see if it's
correct given the data.
Let me see what gaps I
identified and so on.
And in fact, one of the
most amazing applications
of inverse deduction to date
is actually a robot scientist.
So if you look at this
picture, the biologist
is not the guy in the lab coat.
The guy in the lab
coat is actually
a computer scientist and
machine learning researcher
by the name of Ross King.
The biologist in this
picture is this machine here.
This machine is a complete,
automated biologist.
It starts out with
basic knowledge
of molecular biology,
DNA, proteins, RNA,
and all of that stuff.
And then what it
actually does is
it formulates hypotheses
using inverse deduction.
It designs experiments to test
this hypothesis using things
like DNA sequences and whatnot.
That's what's on there.
It physically carries them out.
So it does the whole
process with no human help.
And then, given the results,
it refines the hypotheses,
or comes up with
new ones, and so on.
OK?
Now there's only one of these
robots in the world today.
Its name is Eve.
There was actually a
previous one called Adam.
And Eve, last year,
discovered a new malaria drug.
And the thing that's
amazing about this
is that, well, once you've made
one robot scientist like this,
there's nothing stopping
you from making a million.
And I have a million
scientists working on a problem
where, before, maybe all
we could afford was a few.
So this can really speed
up the progress of science.
And I would say that in
areas like molecular biology,
there is no hope of really
understanding, for example,
very well, how cells work
without this kind of thing.
Right?
There's just too much
information for human beings
to discover on their own.
OK?
Now this is one very,
I think, attractive way
to do machine learning, but of
course, it's not the only one.
In particular,
the Connectionists
are very skeptical of it.
They say, well, this
is too abstract.
It's too clean.
It's too rigid.
Logic is not how the
real world works.
You have to do things
more like human beings do.
And of course, the
way human beings
do things is with their
brains, so let's figure out
how brains work, at least enough
that we can take inspiration
from them and build
algorithms based on that.
So this is what the
Connectionists do.
The most prominent
Connectionist is Geoff Hinton.
He actually started out as
a psychologist 40 years ago
and now he's really more
of a computer scientist.
And he's been
incredibly determined
in his goal of understanding
how the brain learns.
In fact, he tells the
story that one day he
came home from
work very excited,
saying, yea, I've done it.
I've figured out
how the brain works.
And his daughter replied,
oh dad, not again.
And he's the first one
to say that he's had
some successes and failures.
But I think the bottom line is
that, these days-- for example,
he's one of the
co-inventors, of course,
of backpropagation--
the successes far
outweigh the failures.
So definitely, the
long-term research agenda
has already given
a lot of fruits.
And it's going to give more.
And two other prominent
Connectionists
are Yann LeCun and [INAUDIBLE].
Of course, Connectionism
is also known
as neural networks and,
these days, as deep learning.
But it's really all
the same paradigm.
OK?
So how does it all work?
Well, what we're
going to do is we're
going to build a model, a
mathematical model of how
a single neuron works.
We're going to make it as
simple as we can provided
it's enough to learn and to
do the inferences that we
need it to do.
OK?
And then we're going to
put these models of neurons
together into big networks.
And then we're going to
train those networks.
OK?
And at the end of
the day, what we have
is, in some sense,
a miniature brain,
if you will-- of course, much
simpler than the real one,
but hopefully with some
of the same properties.
OK?
So now a neuron is a very
interesting kind of cell.
All right?
It's a cell that actually
looks like a tree.
There's the cell
body, and then there's
the trunk of the tree is
what is called the axon,
and then the branches
are called dendrites.
But where neurons get
very different from trees
is that the branches
of one neuron
actually connect back with the
roots of another-- or in fact,
with the roots of many others.
And that's how you get a
big network of neurons.
And where the
dendrites of one neuron
join the dendrites of another,
that's called the synapse.
And to the best
of our knowledge,
the way humans
learn everything you
know is encoded in the
strings of the synapses
between your neurons.
OK?
So if the synapse
is strong, then--
let me backtrack
for just a second.
The way this works
is that the neurons
communicate via electric
discharges down their axons.
They're literally an
electric discharge
called an action potential.
And what happens is that if you
look at all the charge coming
into a neuron through
its various synapses,
if it exceeds a certain
threshold, then that neuron
itself fires.
And of course, then
it sends currents
to the neurons downstream.
OK?
And the synaptic process itself
involves chemistry and whatnot,
but those details are
not important for us.
OK?
So what's going to happen is
that the learning basically
happens when a neuron helps
to make another neuron fire.
And then the strength of
the connection goes up.
This is called Hebb's Rule.
And as far as we
know, this is how
all our knowledge is encoded is
in how strong the synapses are.
If the neurons fire
together a lot,
then the synapses
become stronger
and it becomes easier
for the first neuron
to fire the second one.
So this is the basic idea.
Now what we have to do is, first
of all, turn it into a model.
That's not that hard.
So the model is just this.
What my neuron's
going to do is it's
going to do a weighted
combination of the inputs.
OK?
Let's suppose, for example, that
this was actually the retina.
All right?
So this is the input.
Each one of these
inputs is a pixel.
And then each one of
them gets multiplied
by a weight that corresponds
to the strength of the synapse.
And if that weighted
sum exceeds a threshold,
then I get 1 as the output.
Otherwise, I get 0.
So for example, if this neuron
is trying to recognize a cat,
if this is a cat, then
hopefully what happens
is that this weighted
sum will be high enough
that this will fire and
the neuron will say,
yes, this is a cat.
Now this is all easy.
This goes back to the 1950s, and
Frank Rosenblatt, and whatnot.
The really hard, and
interesting, and important
question is, how do you train a
whole network of these neurons.
That's when things actually
become very difficult.
Because if you have a
big network of neurons--
so here are my inputs.
They go to one set of neurons.
Right?
These are the functions
that they compute.
Right?
They're in purple here.
And then, those go to another
layer, and many, many layers,
until finally, you
get the output.
Another question is, well, if
the output is correct-- right?
So let's say this was a cat
and the network said that, yes,
it is a cat, then life is good.
Nothing needs to change.
Right?
Why go fixing what ain't broke?
But what happens if
there was an error?
Right?
This should have been
firing, but wasn't.
The key question
is, what do I change
in that whole big, messy
network to try to make it
give the right answer tomorrow.
There is no obvious
answer to this question.
Because think of
one neuron somewhere
in the middle of the network.
How is it responsible for
the error at the output?
All right?
The error at the
output could have
come from an infinitude
of different places.
This is called the credit
assignment problem.
Or maybe it should be
called the blame assignment
problem, because it's deciding
who's to blame for an error,
and therefore, needs to change.
And this is the problem
that backpropagation solves.
All right?
And when people first came
up with neural networks
in the '60s, they
didn't have this.
It was when, finally,
it was invented
in the '80s by David
Rumelhart and others
that things really took off.
And the basic idea in backprop
is actually quite intuitive,
I think.
It's the following.
Well, let's think
of the difference
between my actual output
and my desired output.
Let me call that delta.
Right?
This is the error.
The output should have
been 1, but it was 0.2.
So it needs to go up.
Another question is, what
can I tweak in these weights
to make it go up.
All right?
Well, a weight that is
thinking at this last layer,
well, at that last layer,
the neurons with the highest
weights are the ones most
responsible for the result.
And if this neuron is
saying no but the answer
is yes, well, then
its weight needs
to go down, because it's
preventing it from firing.
And if this one was saying
yes but this is not firing,
then its weight needs to go up.
So what I do is I
compute the derivative
of this error with respect to
the weights in the last layer.
And then with those, I
now have an error signal
coming from these neurons.
It's like, oh, you
should have been higher.
All right?
You should have been higher in
order to make this guy fire,
because you have
a positive weight.
You should have been
lower, because you
have a negative
weight and you're
preventing it from firing.
So now I know an error signal
at the neurons at this layer.
And I can keep
doing the same thing
all the way back to the input.
And this is why it's
called backpropagation.
Because what I'm doing is I'm
propagating back the errors
and then updating the weights,
changing the weights in order
to make that error
as small as possible.
OK?
Well, this is the
backpropagation algorithm.
It's what's at the
heart of deep learning.
And these days, this is used for
just about everything on Earth.
Right?
Very early on, people
used it to do things
like predict the stock market.
These days, you
use it for search,
use it for ad placement, use
it for video recognition,
use it for speech
recognition, use it
for things like simultaneous
translation and whatnot.
But I think, at
least for the public,
the best known instance
of this is still
Google's own famous cat network.
Unfortunately, people think
it's-- it got called "the cat
network" by the
journalists, right,
because it happened to
recognize cats very well.
Actually, it wasn't
an accident, right?
This was all learned
from YouTube videos,
right, as you probably know.
And people really like to
upload videos of their cats.
So there was more data on
cats than on anything else.
But maybe it should be
called "the couch potato
network," because it's based
on watching a lot of video.
And at the time, this was
the biggest neural network
ever built. It had,
I think, on the order
of a billion parameters.
But of course, these days,
there's much bigger ones
and they're continuing to grow.
So we'll see how far we
can take this paradigm.
Now the Evolutionaries,
they say, well sure,
but all you're doing is
adjusting the weights
on this model.
You're just tweaking the
strengths of the synapses.
Where did that brain come from?
That, in some sense,
is the bigger question.
That brain was
produced by evolution.
So maybe what we really should
do if we want to learn really
powerful things is figure
out how evolution works.
And we actually already
have a good sense
of how evolution works.
There's a lot of
details, of course, that
are still to be understood.
But at a high level, we
understand how evolution works.
So let us do the same
thing on the computer.
And the first guy who
really ran with this idea
was John Holland.
He started in the '50s, '60s.
And for a long time, this whole
area of evolutionary computing,
the joke was that it
was John, his students,
and their students.
But then in the '80s, things
took off and a lot of people
started doing it.
John Holland called what he
did "genetic algorithms."
And then John Koza developed
a more powerful version
called genetic programming,
which we'll see in a second.
And Hod Lipson is
one of the people
who are doing very
interesting things
with evolutionary learning these
days, as we'll see in a bit.
So what's the basic idea here?
Well, how does evolution work?
You have a population of
individuals, each of which
is described by its genome.
OK?
But in our case,
in the computer,
the genome, instead
of being base pairs,
instead of being DNA, it's
just going to be bits, right,
because 0 and 1, in some sense,
are the DNA of computers.
And then each of
these individuals
gets to go out in the
world and be evaluated.
It gets evaluated at the task
that it's supposed to be doing.
And the individuals
that do better
will have a higher
fitness and will therefore
have a higher chance
of being the parents
of the next generation.
You get two very fit parents
and you cross over their genomes
just like people do.
And so now you
have a child genome
that is partly the
genome of one parent
and partly the genome of
the other parent, the mother
and father genomes, if you will.
And then you also have
random mutation, right?
Some bits just get
randomly mutated because
of copying errors or whatever.
And then you have
a new population.
And what's remarkable is
that you could actually
start out with a population
that is essentially random.
And after you do some number
of generations of this,
you actually have
things that are doing
a lot of non-trivial functions.
Like, for example, you
can evolve circuits
that-- you can develop radio
receivers, and amplifiers,
and things like that
in just this way.
And they often work
better than the ones
that are designed by humans.
In fact, people
like John Koza, they
have a whole bunch of
patents that were actually--
the patent designs
were actually invented
by the genetic algorithms,
not by the humans.
OK?
Now Koza's idea was actually
to take this one step beyond.
And that's the following.
Well you know, these
strings are a very low level
representation.
And just cutting a string
in the middle to cross over
is probably going
to muck things up.
Right?
We are trying to
evolve programs, right?
At the end of the
day, that's all
we're trying to do
is evolve programs.
So why don't we
actually work directly
with the programs themselves?
And a program is really a tree
of subroutine calls all the way
down to simple operations like
additions, multiplications,
and ands, and ors.
So let's represent
the problems as trees.
And then in order to cross
over two programs, what we're
going to do is we're
going to randomly pick
a node in each of the
trees and then we're
going to swap the
sub-trees at those nodes.
And now we have
two child programs.
OK?
So for example, here, if I do
the crossover at this point,
then one of the
sub-trees that I will get
is the one that's all in white.
And this tree
actually represents
one of Kepler's laws.
This is the law that gives the
duration of a planet's year
as a function of its average
distance from the sun.
It's actually a constant
times the square root
of the cube of the distance.
And that's what this
is representing.
So genetic algorithms
can actually
discover a thing like this and
much more complex ones as well.
In fact, these days,
what the genetic folks
are having a lot of
fun with is something
that's exciting and
scary at the same time.
They're not just doing
this with software anymore.
They're not doing it as a
simulation inside the computer,
but actually doing it out
in the physical, real world
with robots.
You literally start with robots
that are random piles of parts.
And then once those
robots are good enough,
they actually get
printed by a 3-D printer.
And then they start
crawling, and walking,
and doing things in the real
world-- seeing how fast they
can crawl, trying to recover
from injury, and so on and so
forth.
This is actually a real
robot from Hod Lipson's lab.
OK?
And then what happens is
that in each generation,
the fittest robots
get to program
the 3-D printer to produce
the next generation of robots.
So if "Terminator"
comes to pass,
this might be how we get there.
Of course, these
robots are not yet
ready to take over the
world, but they've already
come remarkably far
from the soup of parts
that they began with.
Right?
And these days, if you see a
little spider running around,
take a good look.
Because it could be one
of these instead of one
of the biological ones.
All right.
So the Bayesians-- so here
are some famous Bayesians.
Judea Pearl, within
computer science,
is probably the most famous one.
He actually won the Turing
Award, the Nobel Prize
of computer science,
a few years ago
for inventing something called
Bayesian networks, which
is one very powerful
type of Bayesian model
where it's a big graph where
each node is a variable.
And then the edges
between the nodes
represent dependencies
between the variables.
Two other famous Bayesians
are David Heckerman
and Mike Jordan.
Bayesians are known,
in machine learning,
as the most fanatical
of the five tribes.
All right?
They really have a
near-religious devotion
to their paradigm,
and you know, they're
the first ones to say so.
And I think the reason for
this is that Bayesians had
their origins in statistics.
And for 200 years, Bayesians
were a persecuted minority
in statistics.
So they had to become very
hard-core in order to survive.
And it's a good thing
that they did survive,
because they have a
lot to contribute.
And these days, with
computers, and algorithms
like Markov chain Monte Carlo,
and large data, and whatnot,
they're actually on the
ascendant in statistics.
So what is the basic idea
behind Bayesian learning?
Well, the basic idea
is that, as I said,
everything that you
learn is uncertain.
So what you have to do is
compute the probability
of each one of your
hypotheses and then update it
as new evidence comes in.
And the way you do that
is with Bayes' theorem.
OK?
And Bayesians love
Bayes' theorem so much
that there was this
Bayesian machine learning
startup that actually had a neon
sign of Bayes' theorem made.
And then they stuck it
outside their office.
So there's Bayes' theorem
in big, neon letters.
OK?
So how does Bayes' theorem work?
Bayes' theorem is actually
incredibly simple.
It's so simple it's
barely worth being--
it would be barely
worth calling a theorem
if it wasn't so important.
So the idea is this.
So let's suppose that you
have all your hypotheses.
You define your space of
hypotheses in some way.
And it could be, for example,
the set of all Bayesian
networks, or the set
of all neural networks,
or all decision
trees, or whatever.
And now the first thing
that you're going to have
is the prior probability
for each hypothesis.
This is how much you
believe in that hypothesis
before you've even
seen any data.
And this is actually what
makes Bayesian learning
very controversial is that many
statisticians say, well, you
have no grounds on which
to just make up a prior.
And the Bayesians answer to that
is, you have to make that up,
whether explicitly
or implicitly.
So let's just be
explicit about it.
So the prior is how much you
believe in each hypothesis
before you see the evidence.
But then what happens is that
as the evidence comes in,
you update the probability
of each hypothesis.
A hypothesis that is
consistent with the data
will see its probability go up.
A hypothesis that is
inconsistent with the data
will see its
probability go down.
OK?
And the consistency of
the hypothesis of data
is measured by what's called
the likelihood function, which
is the probability of seeing the
data if the hypothesis is true.
OK?
And this theory is actually
no different from frequency
statistics and the
maximum likelihood
principle, which
is basically saying
that if your hypothesis makes
what you're seeing likely,
then conversely,
what you're seeing
makes your hypothesis likely.
OK?
And the Bayesians incorporate
that in the likelihood.
And the product of the
likelihood and the prior
is just the posterior,
which is how much you
believe the hypothesis after
you've seen the evidence.
So as you see more evidence,
the probabilities evolve.
And hopefully, at
the end of the day,
one hypothesis will come out as
clearly better than the others.
But that won't
necessarily be the case.
You might still be entertaining
a lot of hypotheses
even after you've
seen a lot data.
There's also the
marginal, which is
just something that you have
to divide by to make sure
that the probabilities
add up to 1, so let's
not worry about it for now.
And a lot of great
things have been
done with Bayesian learning.
Like, for example,
self-driving cars
have Bayesian learning
in their brains.
So in some sense, Bayes' theorem
is helping to drive that car
or helping the car to
learn how to drive.
And without it, it
would be much harder.
But one application
of Bayesian learning
that everybody is
probably familiar with
is spam filtering.
The first spam
filter was actually
designed by David Heckerman
and his coworkers.
And they just used this very
simple Bayesian learner called
the naive Bayes classifier.
And the way it works
there is the following.
The hypothesis, before
I've seen evidence,
is that the email is a spam or
that the email is not a spam.
OK?
And the prior probability is
like your prior probability
of an email being spam-- 90%,
99%, 99.999%-- take your pick.
And then the evidence is the
actual content of the email.
So for example, if the email
contains the word "Viagra,"
that probably makes it
more likely to be spam.
If it contains the word
"free" in capitals,
that makes it even
more likely to be spam.
And if that "FREE" is followed
by four exclamation marks,
that makes it even
more likely to be spam.
OK?
On the other hand, if it
contains the name of your best
friend on the signature
line, that actually
makes it less likely to be spam.
OK?
And so what the naive
Bayes classifier does
is it incorporates
that evidence.
And at the end of the day,
it computes a probability
that the email is spam or
not spam taking all of that
into account.
OK?
And then based on
that probability,
you can decide whether to
filter it out or actually
show it to the user.
And we're all grateful
that spam filters exist.
Otherwise our mailboxes
would be unmanageable.
These days, all sorts of
different learning algorithms
get used for spam filtering.
But Bayesian learning
was the first one
and it's still used
in many spam filters.
OK.
Finally, the Analogizers--
so as I mentioned,
the basic idea of
the Analogizers
is that everything that we
do, everything that we learn
is reasoning by analogy.
It's looking at similarities
between the new situation
that we need to
make a decision in
and the situations that
we're already familiar with.
OK?
And one of the early pioneers
in this area was Peter Hart.
He proved some things related to
the nearest neighbor algorithm,
which we'll see
shortly, which is
kind of like the first
similarity-based algorithm.
Vladimir Vapnik is the inventor
of support vector machines,
a.k.a.
kernel machines, which is
the most widely used and most
successful type of
similarity-based learner.
But these are both
actually still
fairly primitive forms
of analogical reasoning.
There are some much
more sophisticated ones
that people, like for example,
Douglas Hofstadter, work on.
And Douglas Hofstadter,
of course, is also famous.
He's not just a quantitative
scientist and computer
scientist, he's also famous as
the author of "Godel, Escher,
Bach," which,
ironically, is actually,
itself, an extended
analogy between Godel's
theorem and the music of
Bach and the art of Escher.
And in fact, his
most recent book
is 500 pages arguing that all
of intelligence is just analogy.
So he really does
think that analogy
is the master algorithm.
And in fact, "the
terminologizer"
was coined by him.
So let's see how this works.
And I'm going to
do it by proposing
to you a very small puzzle.
And the puzzle is this.
Let's suppose I give you a
map of two countries, which
I fancifully called "Posistan"
and "Negaland" because
of positive and
negative examples.
And I don't tell you where
the frontier between the two
countries is.
I just tell you where
the major cities are.
So the major cities in
Posistan are the plus signs,
Positiveville is the
capital, and similarly
for the negative cities.
OK?
And now my question to
you is the following.
If I only tell you where
the major cities are,
can you tell me
where the frontier,
where the border between
the two countries is?
Of course, you can't
know for sure, right,
because the cities don't
determine the border.
But that's what the machine
learning problem is, right?
You have to generalize.
Now the nearest
neighbor algorithm
has a very simple
answer to this question.
It just says, I'm going to
assume that a point on the map
is in Posistan if it's
closer to some positive city
then to any negative city.
OK?
And the effect that this
has is to divide the map
into the neighborhood
of each city.
And then Posistan is just the
union of the neighborhoods
of the positive cities.
The neighborhood of
a city is the points
that are closer to it
than to any other city.
OK?
And then as a result, you
get this jagged straight line
frontier.
And what's remarkable about this
is that even though the nearest
neighbor algorithm is
extremely simple-- in fact,
at learning time, it
doesn't do anything, right?
It's an O of 0 algorithm.
It just sits there,
doesn't have to do
anything-- it can
actually form very, very
complicated frontiers.
In fact, if you
give it enough data,
it will converge very closely.
In fact, it will converge to
the best possible hypothesis
that you could ever have if
you use more than one neighbor.
But let's not go
into those details.
Now there's a couple of things
here that are not ideal.
One is that this
line is probably
not quite the right one, right?
Because the real frontier
is probably smoother.
It's not jagged like that.
And another one is that if
you look at this map closely,
you could actually
throw away some
of the cities, like
this one, for example,
and it wouldn't change anything.
If you threw away
this city, its area
gets absorbed by the areas
of these two other cities.
And the frontier
doesn't change at all.
The only cities that
you need to keep
are the ones that
actually take part
in defining the frontier, the
so-called "support vectors."
In general, these are
vectors in hyperspace.
And they're called
support vectors
because they're the ones that
keep the frontier where it is.
So often, you can throw away the
great majority of your examples
and that doesn't
change anything.
And of course, in this
example, this doesn't matter.
But when you have a data set
with millions or billions
of examples, that does matter.
And support vector machines,
or kernel machines, for short,
solve both of these problems.
So they have a
learning procedure
that lets them throw away
all the examples that are not
necessary to define
the frontier,
so they leave only these.
And they also make
the frontier smoother.
And the way they draw the
frontier is by saying, well,
let me try to walk from
south to north while
keeping the positive
cities on my left
and the negative
cities on my right.
But I want to always stay as far
as possible from them as I can.
OK?
Think of the cities as mines and
think of this as a minefield.
Right?
If I told you to walk
all along a minefield,
you would try to give each
mine the widest possible berth.
You would try to maximize
your margin of safety.
And that's exactly what
support vector machines
do is that they maximize the
margin between the frontier
and the nearest examples.
And in the days before
deep learning took off,
support vector
machines were probably
the most powerful
type of learning
that was commonly used.
OK?
All right, let's look at
one application of this.
Again, this type of
analogy-based learning, people
have been doing
it since the '50s,
so it's used for just
about everything on Earth.
There is, however,
one application
that I think everybody
has experienced
even if they don't know
it's an application
of analogy-based learning.
And that is recommender systems.
OK?
So for example,
let's say that I want
to figure out what movies
to recommend to you.
The idea that folks
had almost 20 years
ago now was a very simple one.
It is, let me not
try to do that based
on the properties of the
movie, because that's hard.
Right?
You know, people's tastes
are very complicated things.
What I should do
is what is called
"collaborative
filtering," is find people
who have similar tastes to
you, meaning they gave five
stars to a movie that
you gave five stars to,
they gave one star to movies
that you gave one star to.
And now if they give
five stars to a movie
that you haven't
seen, then I'm going
to hypothesize that, by analogy,
you will also like that movie.
And so I will
recommend it to you.
OK?
And this turns out to
work spectacularly well.
It works so well, in fact,
that 3/4 of Netflix's business
comes from its
recommender system.
And Amazon also has
a recommender system.
About 1/3 of its business comes
from its recommender system.
And every e-commerce
site worth its salt
has something like this.
And of course,
these days, people
use all sorts of different
learning algorithms for it.
But the first one was this
nearest neighbor type method.
And it's still one of the best.
OK?
So stepping back, we've
met the five tribes
of machine learning.
We've seen that each one of them
has a problem that it can solve
better than all the others.
And it has a particular
master algorithm
that solves that problem.
So for example, the
problem that the Symbolists
solve that none of the
others know how to solve
is the problem of
learning knowledge
that you can compose
in many different ways.
And they learn that knowledge
with inverse deduction,
as we saw.
Connectionists solve the
credit assignment problem
using backprop.
Evolutionaries solve the problem
of learning structure, right?
The Connectionists just
start with a fixed structure
and just adjust the weights.
The Evolutionaries
know how to come up
with that structure in
the first place using
genetic programming.
The Bayesians are all
about uncertainty.
They know how to
deal with the fact
that all the knowledge that
you learn is uncertain.
They know how to update the
probabilities of hypotheses
as they see more data.
They use that using
probabilistic inference, which
is essentially
computationally efficient ways
to apply Bayes' theorem to
very large sets of hypotheses.
And finally, the Analogizers,
they can reason by similarity.
They can actually generalize
from just one or two examples.
All right?
Think of Niels Bohr.
He came up with the first
theory of quantum mechanics
by doing an analogy between
the atom and the solar system,
generalizing from one example.
None of the others can do this.
And probably the best
known analogizer algorithm
these days is kernel machines.
But the thing that
I want to point out
is that, precisely because
each of these problems
is real and important, none
of the individual algorithms
are enough.
What we really need
is a single algorithm
that solves all five
problems at the same time.
We need a "grand unified
theory" of machine learning
in the same sense that
the standard model is
a grand unified theory of
physics or the central dogma
is a grand unified
theory of biology.
And in fact, a bunch
of us have been working
on this problem for a while.
And we've actually
made a lot of progress.
We're still far from
the end of this,
but let me just give you a sense
of where we are at this point.
So if you think for a moment
about this problem of,
so we have these five
algorithms or these five
types of learning, how can
we unify them all into one,
at first, this seems
like a very hard problem.
And in fact, some people
have claimed that it's
an impossible problem to solve.
It seems very hard,
because the algorithms all
look very different.
But if you look at them
closely, actually, they're
not that different.
They all have the same
three parts, representation,
evaluation, and optimization.
So let's look at
what those parts are
and then how we can
do the unification.
So representation is how
the learner represents
what it's learning, the model or
the program that it's learning.
Right?
It's, in some sense,
the programming language
in which the learner is going
to write the algorithm that it
discovered.
Typically, it's not
going to be Java, or C++,
or anything like that, it's
going to be something like
first-order logic.
All right?
But it could be
differential equations.
It could be a linear regression.
It could be all sorts of things.
So the first thing
that we need to do
is to unify the representations.
And a natural
thing to do here is
to start with the
representations
that the Symbolists use,
which are variations
on first-order logic,
and the representation
that the Bayesians use,
which are generally
known as graphical models.
Bayesian networks are one
type of graphical model.
Another type is Markov
networks and so on.
Each of these is already
extremely general.
If you can combine the
two, you can pretty much
represent anything you
might want to represent.
Any computer program
can, for example, already
be represented in
first-order logic.
Any way to deal with uncertainty
and weighing evidence
can be represented
in graphical models.
There's bazillions
of different models
that people have in
statistics that all fit
into that framework.
So if we can combine
the two, then we
have a very good
representation to start with.
And indeed, we have done that.
In essence, what we've
developed is various forms
of probabilistic logic.
So this is a logic
that also incorporates
probability and uncertainty.
And the most widely used is
called Markov logic networks.
It's essentially a combination
of logic and Markov networks.
And it's very simple.
It just starts with formulas
and first-order logic.
Think of a rule in logic,
like if this, then that,
for example.
And then what it does is it
gives each rule a weight.
So if you really
believe the rule,
you give it a high weight.
If you're not sure, you
give it a lower weight.
And then the probability
of a state of the world
goes up with the number
and the weight of the rules
that are true in that world.
OK?
So with this, we can
represent pretty much anything
that we'd like.
Now the next part of
every learning algorithm
is the evaluation.
The evaluation is the scoring
function that tells me
how good a candidate model is.
How well does it fit the data?
How well does it
fit my purposes?
In essence, what
the learning problem
is is to find-- within the space
defined by a representation--
find the program that maximizes
my evaluation function.
OK?
So what should our
evaluation function be?
Well, one obvious candidate is
just the posterior probability
that Bayesians use.
And that, again, has a lot
of other things already,
special cases.
But more generally,
the evaluation
shouldn't really be
part of the algorithm.
It should be
provided by the user.
It's for the user to decide
what the learner should
be optimizing.
So if you're a company and your
purpose is to maximize profits,
then that's what the
evaluation function should be.
If you're a consumer
and your purpose
is to maximize your
happiness, then
that's what should
be being maximized
is some measure
of your happiness.
OK?
So what the mass problem
should be able to do
is take anybody's
objective function
and then learn to optimize that.
OK?
Finally-- right?
I just said the word
"optimize"-- the third part
of this is optimization,
is, how do we actually
find the model that
maximizes that function.
OK?
And here, there's a
natural combination
of ideas from
genetic programming
and backpropagation, namely,
to discover formulas,
we can use genetic programming.
Each formula in first-order
logic is a tree.
Right?
And now I can cross
these trees over
and apply the genetic
process to come up
with better formulas that
better describe my domain.
And then once I have
those formulas, of course,
if I'm using Markov
logic, I need
to come up with weights
for those formulas.
But of course, this is
where backprop comes in.
All right?
I have my big chain
of reasoning involving
many different formulas, and
facts, and different steps.
And all of those have weights.
And in order to
learn those weights,
I can naturally use
backpropagation.
OK?
So we're pretty
far along in this.
We haven't succeeded yet.
But some people think it's only
a matter of time before we do.
I'm actually a
little less gung-ho.
I think that even after we've
successfully unified these five
paradigms, there are still
major ideas missing, ideas
that we haven't had and
without which we will not
have a truly universal learner.
And in fact, part of my
goal in writing the book
was to open it up to people
who are not already machine
learning researchers so they
can think about the problem
and maybe have ideas that
people who are already
thinking along the tracks of
one of the tribes wouldn't have.
So if you figure out how
to solve this problem,
let me know so I can
publish your solution.
OK.
Let me conclude by talking a
little bit about what I think
that the master algorithm will
enable that we cannot do today.
I have four items here.
There are, of course, more.
But I think these
four give a good sense
of just how momentous a
development this would be.
The first one is home robots.
We would all love to have
robots that cook dinner for us,
and do the dishes, and
make the beds, and whatnot.
But why don't we
have them today?
Well, first of all, it can't be
done without machine learning,
right?
There's just no way
to program a robot
to do all the things
that it might have to do.
But second of all, the learning
algorithms that we have today
are not good enough.
Because a robot, a home robot,
in the course of a normal day,
will run into all five of
those problems multiple times.
So it needs to be able
to solve all of them.
So with a master algorithm,
we're on our way.
But without it, I think it'll
be much harder and much slower
if we ever get there.
Here's another one.
Everybody, including,
of course, Google,
has a project to try to turn
the web into a knowledge base.
Right?
Instead of issuing keyword
queries and getting back
pages, what I
would like to do is
ask questions and get answers.
But for that, all the
knowledge that's on the web
has to be represented in a
way that the computer can
reason with, something like,
for example, first-order logic.
On the other hand, the web
is full of contradiction,
and noise, and
gaps, and whatnot,
so it's going to be very noisy.
So I need probability.
OK?
So again, I need to be
able to combine those five
types of learning
in order to really
be able to extract this
knowledge basis from the web.
OK?
Here's perhaps the most
important one, cancer.
Why haven't we cured cancer yet?
The problem is that
cancer is not one disease.
Right?
Everybody's cancer is different.
And in fact, the same cancer
mutates as it goes along.
So it's very unlikely
that there will ever
be one drug that
cures all cancers.
The real cure for
cancer-- or at least,
that's what an increasing number
of cancer researchers believe--
is something like
a learning program
that takes in the genome of the
patient, the medical history,
the mutations in
the tumor cells,
and predicts, for
that tumor, what
is the drug that's going
to kill it without harming
the patient's normal cells--
or maybe the sequence of drugs,
or a combination of
drugs, or perhaps even
a drug that will be
designed from scratch
for that particular patient.
OK?
In some ways, this is not
a very different problem
from recommender systems that
recommend a book or a movie
to you.
However what they do here
is they recommend a drug.
The problem, however,
is that this problem
is orders of magnitude
harder than the problem
of recommending a drug-- or
the problem of recommending
a movie or a book.
You have to understand
how the cell works.
You have to understand
the interactions
between the genes and the
proteins that they make.
And then they go back
and regulate the genes.
And it's when this
machinery gets out
of whack that you get cancer.
OK?
The good news is
we actually have
a lot of data to do this,
things like microarrays,
and sequences, and whatnot.
But again, with the learning
algorithms that we have today,
we're not going
be able to do it.
With something like
the master algorithm,
we will be able to do it.
OK.
Finally, apropos of
recommender systems,
let me mention this one.
What I would really like
to have as a consumer
is not 500 different
recommender systems
recommending 500
different things
to me-- Netflix
recommending movies
and Amazon recommending books
and Facebook selecting updates
and Twitter selecting tweets.
What I want is a complete
360-degree model of me.
Learn from all the
data that I generate.
And then that model knows
me much better than all
these tiny, little
models and, as a result,
can make can make much better
recommendations-- and not
just recommendations
of small things,
but recommendations of
jobs, recommendations
of houses, recommendations of
what to major in, or-- oops,
I guess I skipped
this slide here.
As the foremost city of the
federal government said,
if we use these
things, we can actually
have a recommender system
that is, in essence, your best
friend throughout your life,
recommending the things
that you need at every step.
OK?
And again, in order to do that,
we need not just the data,
which, increasingly, we
have, we need the algorithms
that are powerful enough to
learn that rich model of you.
OK?
More about all these
things in the book,
"The Master Algorithm."
And thank you for listening
and I'll take questions.
[APPLAUSE]
AUDIENCE: The representation
that you proposed,
the Markov logic network,
covered the natural language.
Like, is there a
one-to-one mapping between
the representation and the--
PEDRO DOMINGOS: Yeah.
In fact, one of
the biggest areas
where Markov logic has been
applied is natural language.
It's a very good match
for natural language,
because natural language is
both very compositional-- right,
so you need the logic for
that-- and also very ambiguous,
very uncertain, very noisy.
So you need a
probability for that.
So people, at this point, have
applied Markov logic networks
to pretty much every major
problem in natural language,
and won competitions
using it, and so forth.
So in some ways,
natural language
is one of the killer apps
for Markov logic networks.
AUDIENCE: Is there a
way for us to learn
to use it in the genetic
programming algorithms?
PEDRO DOMINGOS: So the
question is, are we
able to learn Markov
logic networks using
genetic programming.
People haven't done that yet.
So we've learned-- there's
a wide range of algorithms
for learning the structure
of Markov logic networks.
They are similar to
genetic programming,
but without the crossover.
All right?
The Evolutionaries really
believe in crossover.
Everybody else in
machine learning
thinks something like hill
climbing, greedy search,
or beam search is
probably enough.
And there are many
different search
methods that have been used
and that work pretty well.
It's actually a good question.
That part has not been done
yet, whether actually doing
the crossover between
the formulas will help.
AUDIENCE: Related question
to genetic programming--
how is the solution
space limited?
Because it seems that
combining various sub-trees
would lead to a rapidly fast,
exponentially growing solution
space.
And has there been any
success with NP-hard problems?
PEDRO DOMINGOS: So something
that people have observed
in genetic programming is
that the trees tend to get
bigger and bigger over time.
People jokingly call it the
"survival of the fattest."
What you can do
to combat that is
to actually have
a bias, a penalty
on the size of the trees.
So if the trees are big, they're
less fit just because of that.
You put that into your
objective function.
AUDIENCE: NP-hard
problems-- any success
in that area with genetic
algorithms or programming?
PEDRO DOMINGOS: Well, the
short answer to that is yes
and the longer
answer is yes, but.
A lot of the
problems that people
are approaching with
genetic algorithms
are NP hard problems.
The question that the people
who aren't Evolutionaries ask
is, did you really
solve-- so first of all,
they're NP-hard problems, right?
So you can't solve the
worst, the hardest instances
efficiently.
But there are many instances
that aren't the hardest.
And so, can you solve
them well enough?
And there are examples of
genetic algorithms solving
these problems.
But there are also
counterexamples
of people saying,
look, I could have
done that with hill
climbing and it
would have been just as
fast or just as good.
So the jury's still out on this.
AUDIENCE: So with your
360-degree recommender systems,
what's to keep that from being
self-fulfilling or guiding
a person into some particular
path that's not, perhaps,
[INAUDIBLE]?
PEDRO DOMINGOS: What's
going to keep that
from being self-fulfilling is
the way machine learning works,
which is the system
recommends something to you.
Or for example, it recommends
10 alternatives to you.
All right?
Or let's say it recommends
one, but then you say, no,
this was the wrong one.
Then it learns from that.
So the recommender
system is not something
that you learn one day
offline and then you use it.
It's something that
is continuously
learning from what you do.
So if it starts doing the wrong
thing, you start being unhappy
and you start displaying
that unhappiness
in one form or another.
And then it learns, from
that, to try something else.
And also remember,
these systems, they
can talk with each
other to the extent
that you decide that they can.
So it won't just be
learning from you,
it will be learning
from a lot of people.
All right?
So it's always the
question of, how far do
you have to generalize, right?
The more data that you have,
the easier the problem becomes.
And the more you have a
continuous loop of feedback,
the more robust the learning is.
AUDIENCE: The 360
recommender, how do you
square it with the need
for privacy for people?
PEDRO DOMINGOS: Yeah, so that's
a very important question
which I didn't allude to.
I want to have a
360-degree model of me,
but I want it to be
under my control.
Because if somebody
else has a model of me
that knows me better
than my best friend,
they have too much power.
So I think what needs to happen
is that-- right now, what
happens is that the
data that you generate
is spread all over the place.
It would be good to bring
it all to one place,
but that place has to
be under your control.
And I think one way
that this could work
is that you could have a
company that is to a data
like a bank is to your money.
Right?
You put your money in the
bank, and then the bank
invests it for you,
and so on and so forth.
But ultimately, the money
is under your control.
So I think the same thing
should happen with your data.
So this company
aggregates the data.
It learns the model.
It uses the model
on your behalf.
But at the end of
the day, it's for you
to decide whether,
maybe, for example, you
want to take the
data somewhere else
or you want to do
something else with it.
I think if you don't do
that, people won't trust
this enough for it to happen.
AUDIENCE: What are your
thoughts on TensorFlow?
PEDRO DOMINGOS: I haven't
played with TensorFlow myself.
I think, first of
all, it's great
that TensorFlow
has been released.
I think releasing
open source software
like this is a large part
of how we make progress.
And definitely, deep learning
and doing it on a large scale
are very important.
There are a number of other
alternatives out there.
We'll see how TensorFlow
compares with them.
And we'll also see-- like just
from my own point of view--
I have some of my students
working on deep learning--
the question with each
one of these systems
is, what does it
support well versus not.
I think if the learning
that you're trying to do
fits within the
paradigm of TensorFlow,
then it's probably
a good thing to use.
But if it doesn't, then you
may need something else.
Or maybe what's going to
happen is that there will
be an extension of TensorFlow to
do these things in the future.
AUDIENCE: One of
its goals, actually,
is to become a universal system
for expressing AI solutions.
PEDRO DOMINGOS: Yeah, and I very
much sympathize with that goal.
I think that in order to
be a universal system,
you have to cover
these five paradigms.
But one way to cover them
is to start from one--
let's say deep
learning-- and then try
to absorb another, and
another, and another.
And there are people
doing this kind of thing.
Like for example, we
have the Alchemy System,
which combines the Symbolist
and the Bayesian learning.
We've also developed,
for example,
combinations of
symbolic learning
and instance-based learning
and this with neural networks.
So what I hope to see from
a platform like TensorFlow
is see it absorbing more and
more of these capabilities.
And absorbing them
doesn't necessarily
mean going and giving
us some primitives
to do what that school does.
That's OK, but it also
increases the complexity.
The ideal thing
would be you still
have something that is simple,
but yet with that simple thing,
you can do all these
combinations of things.
And I do think that's possible.
And I'm curious to
see how this is all
going to evolve in the
next several years.
MALE SPEAKER: And
with that, please
join me in thanking
Pedro for this talk.
PEDRO DOMINGOS: Thank you.
[APPLAUSE]
