
English: 
the great pleasure for me to be here and
the lecture here um I don't know if you
know that but there is really this close
connection between deep mind and UCL
which started off by two of our founders
Thomas and Shane meeting as post
doctoral fellows at UCL and then ending
up founding deep mind so really at the
very root of things deep mind and UCL
are connected and I think it's great
that we can celebrate that and and use
it and by putting out this lecture
series and and sharing these thoughts
together I will only be the first of

English: 
It's a great pleasure for me to
be here and to lecture here.
I don't know if you know that,
but there is really this close
connection between DeepMind and
UCL, which started off by two of
our founders, Demis and Shane,
meeting as postdoctoral fellows
at UCL and then ending up
founding DeepMind.
So, really at the very root of
things, DeepMind and UCL are
connected, and I think it's great
that we can celebrate that and
use it
by putting out this lecture
series and sharing these
thoughts together. I will only
be the first of several
lecturers.
Here are all the lecturers

English: 
several lecturers here are all the
lecturers within the series and you will
get to know these people if you stick
through their lectures and they have a
lot of interesting and wonderful things
to say and towards the end of this
lecture I'm going to go through the
topics that they will cover and try to
motivate them in the context of the
larger field of deep learning here's the
plan for the lecture first wheeled self
intelligence and and then we'll do some
other things when I have the word
solving intelligence here that refers to
the first part of deep minds mission
deep minds missions Hadley has these two
parts first solve intelligence and
second use it to solve everything else
and well that is of course an audacious
mission statement we do believe that

English: 
within the series, and you will
get to know these people if you
stick through their lectures, and
they have a lot of interesting
and wonderful things to say.
And, towards the end of this
lecture, I'm going to go through
the topics that they will cover
and try to motivate them in the
context of the larger field of
deep learning.
Here's the plan for the lecture.
First we'll solve intelligence,
and then we'll do some other
things.
When I have the words "Solving
Intelligence" here, that refers
to the first part of DeepMind's
mission. DeepMind's mission has
these two parts. First, solve
intelligence and, second, use it
to solve everything else.
And, while that is, of course,
an audacious mission statement,
we do believe that it's a great
North Star to guide our research,

English: 
and so far it has created a lot
of great momentum.
So, I could talk about deep
learning in a very broad sense,
giving examples from all walks
of life, but I'm sure you're
familiar with most of those.
And so I'll take a more personal
view, a more DeepMind-centric
view also, because that's the
work that I can authentically
speak about.
And so what I would like to do
is go through three case studies
of successful deep learning
applications, to show you the
power of deep learning and to
motivate you to study it
further.
And, in these three case studies,
a lot of things will come up
that you might not understand at
this point in time fully, but
then I can assure you that the
subsequent lecturers will fill
subsequent lecturers will fill
in those gaps and make you
appreciate what's happening
here.
So, the first case study is

English: 
it's a great North Star to guide our
research and so far it it has created a
lot of great momentum so I could talk
about deep learning in a very broad
sense giving examples from all walks of
life but I'm sure you're familiar with
most most of those and so I'll take a
more personal view a more deep
centric view also because that's the
work that I can authentically speak
about and so what I would like to do is
go through three case studies of
successful deep learning applications to
show you the power of deep learning and
to motivate you to study it further and
in these three case studies a lot of
things will come up that you might not
understand at this point in time fully
but then I can assure you that the
subsequent lectures will fill in those
gaps and and make you appreciate what's
happening here so the first case study

English: 
AlphaGo and AlphaZero.
The second one has a little more
action to it:
learning to play the game of
Capture the Flag. And the third
one goes beyond games and is
about folding proteins with
AlphaFold, a deep learning-based
system. And it takes us into the
realm of biology and science.
And then, finally, in the last bit of
the lecture, I want to go over
the pieces that the subsequent
lectures will deliver and put
them into a greater context,
to tell you what's out there and
why it's worth learning about
these things.
So, let's start with solving
intelligence.
intelligence.
The hallmark of human
intelligence is its generality,
and nobody has expressed this in
a crisper way than the science
fiction author Robert A.
Heinlein.
So, he says, "A human being
should be able to change a
diaper, plan an invasion,
butcher a hog, conn a ship,

English: 
is alphago and alpha0 the second one has
a little more action to it learning to
play the game of capture the flag and
the third one goes beyond games and is
about folding proteins with alpha fold a
deep learning based system and it takes
us into the realm of biology and science
and then finally the last bit of the
lecture I want to go over the the pieces
that the subsequent lectures will
deliver and put them into a greater
context to tell you what's out there and
why it's worth learning about these
things so let's start with solving
intelligence the hallmark of human
intelligence is its generality and
nobody has expressed this in a crisper
way than the science fiction author
Robert a Heinlein so he says a human
being should be able to change a diaper

English: 
design a building, write a
sonnet, balance accounts, build
a wall, set a bone,
comfort the dying, take orders,
give orders, cooperate,
act alone, solve equations,
analyse a new problem, pitch
manure, program a computer, cook
a tasty meal, fight efficiently,
die gallantly. Specialisation is
for insects." Now, nothing
against insects because they are
actually smart in their very own
way, and I'm not sure we have
actually fully reached that
level of intelligence yet, but
for the purpose of this
definition, the idea is: The
ability to do a wide range of
things well is a
characterisation of
intelligence.
Now, my colleague Shane Legg is
very passionate about the
definition of intelligence, so

English: 
plan an invasion butcher a hog Koerner
ship design a building write a sonnet
balance accounts build a wall set a bone
comfort the dying take orders give
orders cooperate act alone solve
equations analyze a new problem pitch
manure program a computer cook a tasty
meal fight efficiently die gallantly
specialization is for insects now
nothing accounts against insects because
they are actually smart in their very
own way
and I'm not sure we have actually fully
reached that level of intelligence yet
but for the purpose of this definition
the idea is the ability to do a wide
range of things well is a
characterization of intelligence now my
colleague Shane leg is very passionate
about the definition of intelligence so

English: 
passionate that he sifted through over
70 definitions of intelligence before he
arrived at his own synthesis and his
definition is intelligence measures an
agent's ability to achieve goals in a
wide range of environments you see how
that's connected to the Heinlein quote
and to what we think of as intelligent
behavior now why is this important well
if we want to create artificial
intelligence we better have some kind of
idea how to measure success how to know
when we have an intelligent agent now
Shane also has mathematics at his heart
close to his heart and so he has a
formal theory of this definition of
intelligence and we will not go into the
details but I would just like to point

English: 
passionate that he sifted
through over 70 definitions of
intelligence before he arrived
at his own synthesis, and his
definition is: "Intelligence
measures an agent's ability to
achieve goals in a wide range of
environments." You see how
that's connected to the Heinlein
quote and to what we think of as
intelligent behaviour.
Now, why is this important?
Well, if we want to create
artificial intelligence, we
better have some kind of idea of
how to measure success, how to
know when we have an intelligent
agent.
Now, Shane also has mathematics
at his heart,
close to his heart, and so he
has a formal theory of this
definition of intelligence.
And we will not go into the
details, but I would just like to
briefly point it out.

English: 
So, this measure of intelligence
here, on the left, is a function
of pi, a policy where a policy
determines what action to take
in a given state. Now, this
measure of intelligence is
expressed as the sum over
environments.
And this term represents the
breadth of all the things that
an intelligent agent should be
able to do.
And he formalises this in the
framework of algorithmic
information theory.
And so he talks about the sum
over all computable environments.
Now we need something that
indicates success.
And that's this term here.
The value that policy pi creates
in environment mu.
So, how successful is that policy
when we expose it to that
particular task or environment?
And here, this thing in the
middle is a complexity penalty,

English: 
briefly pointed out so this measure of
intelligence here on the left is a
function of PI a policy where a policy
determines what action to take in a
given state now this measure of
intelligence is expressed as the sum
over environments and this term
represents the breadth of all the things
that an intelligent agent should be able
to do and he formalizes this in the
framework of algorithmic information
theory and so he talks about the sum
over all computable environments now we
need some something that indicates
success and that's this term here the
value that policy PI creates in
environment mu so how successful is that
policy or
we expose it to that particular task or
environment and here's this thing in the

English: 
middle is a complexity penalty a waiting
term K of MU is the comic or of
complexity of the environment mu and so
what this says is that if this
complexity is low then this term is
great and if this complexity is high
then this term will be smaller and so
the definition gives more weight to
simple environments and then
progressively less weight to more
complex ones of course there are many
more complex environments than simple
ones and so it also acts as a
normalization now the notion of the
value and the policy come from a
framework called reinforcement learning
and in many ways we think of it as a
very general purpose framework for AI
the idea is that there's an agent and
the agent interacts with an environment
and that environment poses some kind of
task or problem to the agent if you like

English: 
a weighting term: K of mu is the
Kolmogorov complexity of the
environment mu.
And so what this says is that if
this complexity is low, then this
term is great.
And if this complexity is high,
then this term will be smaller.
And so the definition gives more
weight to simple environments
and then progressively less
weight to more complex ones.
Of course, there are many more
complex environments than simple
ones.
And so it also acts as a
normalisation.
Now, the notion of the value and
the policy come from an AI
framework called reinforcement
learning, and in many ways we
think of it as a very general
purpose framework for AI.
The idea is that there's an
agent, and the agent interacts
with an environment, and that
environment poses some kind of
task or problem to the agent, if
you like.
And so the agent here,

English: 
and so the agent here represented as a
neural net observes the state of that
world and can then take an action in
that world and influence the world and
once it has taken that action it
receives the subsequent observation what
has happened as a consequence of that
action and it receives the reward
symbolized by the star here is there
some kind of positive impact on the
agent some measure of success and the
goal of this agent is to learn a policy
you know the PI from the previous
definition such that it maximizes
long-term reward so ideally it doesn't
just go for the immediate reward but it
it plans ahead it tries to act in such a
way that in the long term it will be
successful in this environment now the
beautiful thing is that this framework
is so general some people would say
overly general that it encompasses

English: 
represented as a neural net,
observes the state of that world
and can then take an action in
that world and influence the
world.
And, once it has taken that
action, it receives the
subsequent observation what has
happened as a consequence of
that action, and it receives the
reward symbolised by the star
here.
Is there some kind of positive
impact on the agent, some
measure of success? And the goal
of this agent is to learn a
policy,
you know, the pi from the
previous definition, such that
it maximises long-term reward.
So, ideally, it doesn't just go
for the immediate reward, but it
plans ahead.
It tries to act in such a way
that in the long term it will be
successful in this environment.
Now, the beautiful thing is that
this framework is so general -
some people would say overly
general - but it encompasses
things like unsupervised
learning and supervised learning

English: 
as special cases.
But, if you want to learn more
about it, there is actually a
module that colleagues of mine
are teaching at UCL here,
on reinforcement learning. So,
this framework will be important
going forward
in those first two applications
of deep learning that I'm going
to talk about. And the reason is
that the combination of deep
learning and reinforcement
learning -
we also refer to it as deep
reinforcement learning -
is such a powerful combination
that we can use to solve
interesting interactive problems
out there in the world.
Now, we started a lot of our work
with games, and you might be
aware of the work on Atari games
very early on.
And why do we do that?
Well, first of all games are a
bit like this reinforcement
learning, you know, you interact
with this world and try to solve
problems. Really often they're a
microcosm of the real world.

English: 
things like unsupervised learning and
supervised learning as special cases
but if you want to learn more about it
there is actually a module that
colleagues of mine are teaching at UCL
here on reinforcement learning so this
framework will be important going
forward in those first two applications
of deep learning that I'm going to talk
about and the reason is that the
combination of deep learning and
reinforcement learning we also refer to
it as a deep reinforcement learning is
such a powerful combination that we can
use to solve interesting interactive
problems out there in the world now we
started a lot of our work with games and
you might be aware of the work on atari
games very early on and why do we do
that
well first of all games are a bit like
this reinforcement learning you know you
interact with this world and try to
solve problems really often they're a

English: 
microcosm of the real world if you think
about typical games you know they are
about value monopoly is about money and
buying and selling and chess is about
has spatial dimensions and time built
into it and disease is a war game they
have been designed to stimulate
intelligence the designers of this game
specifically want to stimulate human
intelligence so clearly they must have
some aspect to them that is is of
interest when we want to build
intelligence the great thing is we can
simulate games you know we can set up
simulations large-scale computer
simulations and and learn very quickly
by playing these games and finally games
are great for us to measure progress
because often there's some kind of
winning or score or success measure
associated with games they just think of
video games with the little score
indicator you know you want to ramp up
that score that's great that's a way of
measuring progress and in the context of
our L of reinforcement learning it can

English: 
If you think about typical games,
you know, they are about value.
Monopoly is about money and
buying and selling, and chess is
about spatial dimensions and
time built into it.
This is a war game.
They have been designed to
stimulate intelligence.
The designers of these games
specifically want to stimulate
human intelligence.
So, clearly they must have some
aspect to them that is of
interest when we want to build
intelligence.
The great thing is we can
simulate games.
You know, we can set up
simulations, large-scale
computer simulations, and learn
very quickly by playing these
games.
And, finally, games are great for
us to measure progress, because
often there's some kind of
winning or score or success
measure associated with games.
Just think of video games with the
little score indicator, you want
to ramp up that score.
That's great.
That's a way of measuring
progress.
And, in the context of RL,
of reinforcement learning, it can

English: 
also serve as a reward so that's what it
looks like then if we apply deep
reinforcement learning to a game in this
case pong you know the ideas are the
same the agent observes the environment
and take
actions and then the score here is the
reward that the agent gets the reward
that is trying to maximize in the long
term and then in deep reinforcement
learning the policy here the thing that
decides in a given state what action to
take drastic up drastic down and so on
based on seeing this pixel image in this
case that is parameterized by a neural
network whose parameters were trying to
adapt so that the agent has success in
the long term for example optimizing

English: 
also serve as a reward.
So, that's what it looks like
then if we apply deep
reinforcement learning to a
game,
in this case Pong. You know, the
ideas are the same, the agent
observes the environment and
takes actions, and then the score
here is the reward that the
agent gets, the reward that it's
trying to maximise in the long
term.
And then, in deep reinforcement
learning, the policy here, the
thing that decides in a given
state what action to take,
joystick up, joystick down and so
on, based on seeing this pixel
image in this case, that is
parameterised by a neural
network whose parameters we're
trying to adapt so that the
agent has success in the long
term.
For example, optimising a long-term
discounted reward.
So, in this particular
application,

English: 
long term discounted reward so in this
particular application my colleagues ran
these reinforcement learning algorithms
over close to 50 different Atari games
and achieved superhuman level in a lot
of these games really by putting the
controller into the hand of the
reinforcement learning algorithm letting
it play these games observing the screen
if you like the pixels on the screen and
training the system to maximize reward
or game score in these games and I just
want to use this as an example for how
to think about Shane's definition you
remember there was the sum over
environments and the value that the
agent generates in these environments
now think of the set of environments as
these games and then if we have an
algorithm that can do well in all of
these games then according to Shane's
definition we might be tempted to call
that agent to have acquired some degree
of intelligence of course not anywhere

English: 
my colleagues ran these
reinforcement learning
algorithms over close to 50
different Atari games and
achieved a superhuman level in a
lot of these games, really by
putting the controller into the
hand of the reinforcement
learning algorithm, letting it
play these games, observing the
screen, if you like, the pixels
on the screen, and training the
system to maximise reward or
game score in these games.
And I just want to use this as
an example for how to think
about Shane's definition.
You remember there was this sum
over environments and the value
that the agent generates in
these environments.
Now think of the set of
environments as these games.
And then, if we have an algorithm
that can do well in all of these
games, then according to Shane's
definition we might be tempted
to call that agent to have
acquired some degree of
intelligence.
Of course, not anywhere close to
human intelligence.

English: 
But, you know, just that ability
to solve many different tasks to
a high standard as the hallmark
of intelligence.
Okay,
so, what's the role exactly of
deep learning here?
Well, in previous machine
learning work, prior to the deep
learning wave, if you like, for
every problem that you wanted to
solve with machine learning, you
first needed to define features
that describe the state of the
problem. For documents, there
would be bag-of-words features
and, for visual problems, there
would be particular filters that
people defined, edge detectors
and so on.
And the new thing with deep
learning, not new now,
but, you know, it was back then,
is to enable end-to-end
learning, to put the raw
features, the pixels, the raw
description of the problem in
and learn the desired input-
output mapping,
just given the loss, how you

English: 
close to human intelligence but you know
just that ability to solve many
different tasks to a high standard as
the hallmark of intelligence okay so
what's the role exactly of deep learning
here well in previous machine learning
work prior to the deep learning wave if
you like for every problem that you
wanted to solve with machine learning
you first need to define features that
describe the state of
the problem you know for documents that
would be bag of words features and for
visual problems that would be particular
filters that people defined edge
detectors and so on and the new thing
with deep learning not new now but you
know it was back then is to enable
end-to-end learning to put the raw
features the pixels near the raw
description of the problem in and learn
the desired input output mapping just

English: 
given the loss how you measure success
and the architecture of your neural
network that's really what I would call
the the definition of deep learning
now one beautiful thing about deep
learning is that through the
architecture we can put prior knowledge
into the solution of our problem and
that makes learning easier in other
words it requires less training data if
we can do that and hence also less
compute and we will talk a little more
about this later but this prior
knowledge could for example be about
space and be about time those basic
kunti and notions if you like now what
makes deep learning possible and
attractive now I would argue it's the
great computational power that we have
available now GPUs TP use and so on it's
the large amount of data that we now
have generated by mobile devices online
services distributed sensors labels
generated by crowdsourcing by people if

English: 
measure success, and the
architecture of your neural
network. That's really what I
would call the definition of
deep learning.
Now, one beautiful thing about
deep learning is that, through
the architecture, we can put
prior knowledge into the
solution of our problem, and that
makes learning easier.
In other words, it requires less
training data if we can do that
and hence also less compute.
And we'll talk a little more
about this later,
but this prior knowledge could,
for example, be about space and
be about time, those basic
Kantian notions, if you like.
Now, what makes deep learning
possible and attractive
now? I would argue it's the
great computational power that
we have available now, GPUs,
TPUs, and so on.
It's the large amount of data
that we now have, generated by
mobile devices, online services,
distributed sensors, labels
generated by crowdsourcing, by
people, if you like.
And, finally, our better
understanding of algorithms and

English: 
architectures. And there's a
great opportunity here because a
lot of these algorithms are out
there, you know, they're on
GitHub, you can download and play
with them, and a lot of the
papers are on arXiv, you know,
as soon as they're written,
people upload them on
arXiv, and there's a huge
repository of information there
to get started in deep learning.
Okay,
I would like to move on now to
these case studies and start off
with AlphaGo and AlphaZero.
Some of you may have heard about
these projects, and so I hope I
can deliver some more details on
these and give you the general
gist.
The core paper that I want to
talk about is this paper: "A
general reinforcement learning
algorithm that masters chess,
shogi, and Go through self-play
learning" with my great
colleagues David Silver, Thomas
Hubert,
Julian Schrittwieser, Ioannis
Antonoglou,

English: 
you like and finally our better
understanding of algorithms and
architectures and there's a great
opportunity here because a lot of these
algorithms are out there you know
they're an uncute hub you can download
them and play with them and a lot of the
papers are an archive you know as soon
as they're written people upload them on
the archive and there's a huge
repository of information there to get
started in deep learning good ok I would
like to move on now to these case
studies and start off with alphago and
alpha zero some of you may have heard
about these projects and so I hope I can
deliver some some more details on these
and give you the general gist the core
paper that I want to talk about is this
paper a general reinforcement learning
algorithm that masters chess shogi and
go through self play learning with my
great colleagues David silver tomorrow
bear Julien Schmidt feasor and Eunice

English: 
and others. What you see here is
the scene where Aja, who we also
call "The Hand of AlphaGo" because
he acts as AlphaGo's hand when
he plays,
and here Lee Sedol on the other
side in this 2016 match, which
was captured in this Netflix
documentary that you might want
to check out if you haven't seen
it.
So, what's the problem with Go?
Go is a beautiful game, complex,
with beautiful strategies. It
doesn't take long to learn but a
lifetime to master, and the
problem is that there are so
many different moves in any
given position. There are 361
vertices on which you can take
turns to place black and white
stones to surround territory, and
there's just so many different
ways in which games can develop.
And that's where deep learning

English: 
and no blue and others what you see here
is the scene where were ad job who we
also call the hand of alphago
because he acts as alpha goes hand where
he plays and here lee sedol on the other
side in this 2016 match which was
captured in in this Netflix documentary
that you might want to check out if you
haven't seen it so what's the problem
with go go is a beautiful game complex
with beautiful strategies doesn't take
long to learn but a lifetime to master
and the problem is that there are so
many different moves in any given
position the 361 vertices on which you
can take turns to place black and white
stones to surround territory and there's
just so many different ways in which

English: 
kicks in, and in particular we
use two neural networks to reduce
the size of the search space,
the space of possible games in
which we need to do our
planning.
The first one we call the
policy network, and the policy
network takes as input a raw Go
position, characterised by empty
black and white points on this
19 by 19 grid, and maps it to a
probability distribution over
moves.
So, given a Go position, this
thing gives you a probability
for each move being played in a
particular position.
Now, we have a second neural
network we
call the value network, and the
value network also takes a given
position but it just produces
one number, basically -
the evaluation of that position.
Is this position good for black,
or is it good for white?
And how was this trained in
AlphaGo?

English: 
games can develop and that's where deep
learning kicks in and in particular we
use two neural networks to reduce the
size of the search space the space of
possible games in which we need to do
our planning the first one we call the
policy network and the policy network
takes as input a raw goal position
you're now characterized by empty black
and white points on this 19 by 19 grid
and maps it to a probability
distribution over moves so given a go
position this thing is a probability for
each move being played in a particular
position now we have a second neural
network we call the value network and
the value network also takes a given
position but it just produces one number
basically the evaluation of that
position is this position good for black
or is it
good for white and how was this trained

English: 
in alphago well we were lucky we had
access to a lot of human game records
that people had recorded from very
strong players and so the first thing we
could do was imitation learning we could
use deep learning to learn the policy
network to really just learn to imitate
the human players and that gave us the
weights for the policy network you know
the network observes a position it
observes the professional or the highly
skilled move played in that position and
now it's a simple mapping from the input
board representation to that label if
you like now at that point we had a
policy network that was able to play in
a similar way to very strong human
players it could imitate them so we
could use that neural network to very
quickly generate more games and so we
generated a lot of games that then
allowed us to train the value network

English: 
Well, we were lucky, we had
access to a lot of human game
records that people had recorded
from very strong players.
And so the first thing we could
do was imitation learning.
We could use deep learning to
learn the policy network to
really just learn to imitate the
human players.
And that gave us the weights for
the policy network.
You know, the network observes a
position, it observes the
professional or the highly
skilled move played in that
position,
and now it's a simple mapping
from the input board
representation to that label.
Now, at that point we had a
policy network that was able to
play in a similar way to very
strong human players.
It could imitate them. So, we
could use that neural network to
very quickly generate more
games.
And so we generated a lot of
games that then allowed us to
train the value network, because

English: 
because what the value network requires
is an again an input representation for
the board and the outcome of the game
did black win or did white win and from
very many such pairs it can then learn
the probability for any given position
for black or white to win and this is
already a form of reinforcement learning
because we're learning the value
function that I talked about earlier now
how do we use these neural networks they
actually use them much in the way that
humans would use their intuition when
they approach the game the problem is
there's this huge search tree when you
expand from a given position all the
different ways black can play than all
the counters by white and black and
white and so on it's a huge search space
and it would be hopeless to just try and
plan within that space if you didn't
have any guidance but these two neural

English: 
what the value network requires
is, again, an input representation for
the board and the outcome of the
game.
Did black win, or did white win?
And, from very many such pairs, it
can then learn the probability
for any given position for black
or white to win. And this is
already a form of reinforcement
learning because we're learning
the value function that I talked
about earlier.
Now, how do we use these neural
networks? We actually use them
much in the way that humans
would use their intuition when
they approach the game.
The problem is there's this huge
search tree when you expand from
a given position: all the
different ways black can play,
then all the counters by white,
and black, and white, and so on.
It's a huge search space, and it
would be hopeless to just try
and plan within that space if
you didn't have any guidance.
But these two neural networks,
they give us that guidance. The

English: 
networks they give us that guidance the
policy network allows us to be smart
about the moves that we choose we don't
need to check out all the moves that
start from this position
we can focus on the promising ones on
the ones where the professional or
strong go player would be likely to play
and that biases the search in the right
direction now the problem that remains
is that the game tree is still very deep
a typical game of go can last 200 moves
250 moves even longer sometimes so how
do we deal with that that's where the
value net comes in because we don't need
to go all the way to the end of the game
to observe its outcome if black wins or
white wins we can stop somewhere in the
middle after a few moves and use the
trained value network to give us an
estimate of how good the position is for
black or for white and so together the
policy network and the value network
reduce the size of this huge search tree

English: 
policy network allows us to be
smart about the moves that we
choose.
We don't need to check all the
moves that start from this
position.
We can focus on the promising
ones, on the ones where
the professional or strong Go player
would be likely to play, and that
biases the search in the right
direction.
Now, the problem that remains is
that the game tree is still very
deep.
The typical game of Go can last
200 moves, 250 moves, even longer
sometimes,
so how do we deal with that?
That's where the value net comes
in, because we don't need to go
all the way to the end of the
game to observe its outcome,
if black wins or white wins. We
can stop somewhere in the
middle,
after a few moves, and use the
trained value network to give us
an estimate of how good the
position is for black or for
white. And so, together, the
policy network and the value
network reduce the size of this
huge search tree and allows us

English: 
to traverse it and find good
plans in it.
And that's what AlphaGo does.
And it worked.
We weren't always sure it would
work,
but in 2016 we had the match
against Lee Sedol, a phenomenal
9-dan professional Go player
from Korea.
And, at that point, no program
had ever beaten a professional
player at that level
in a match. And, in a very
exciting match, AlphaGo ended up
winning four games to one.
And, if you want to share that
excitement, I really recommend
that you take a look at this
Netflix documentary, "AlphaGo",
which details our path there and
also the drama of the match.
Good.
So much for AlphaGo.
But we weren't quite happy with that
because Go is just one
game, right?
And so we said that intelligence

English: 
and allow us to traverse it and find
good plans in it and that's what alphago
does and it worked we weren't always
sure it would work but in 1968 in 2016
we had the match against isa dal a
phenomenal 9dan professional go player
from korea and at that point no program
had ever beat a professional player at
that level in a match and in a very
exciting match alphago ended up winning
four games to one and if you want to
share that excitement I really recommend
that you take a look at this Netflix
documentary alphago which details our
paths there and and also the drama of
the match good so much for alphago but
we weren't quite happy with that because
alphago go is just one game right and so

English: 
we said that intelligence requires us
the agent to be able to solve more than
one game maybe three games you know not
much but a little better and so I'd like
to talk about alpha zero and and how
that managers not only to play
two more games but also use much less
human knowledge because remember alphago
was still using those professional game
records and also some some additional
features that we had designed and so
what's interesting is that a short story
by Stefan spike the Royal game can give
us some insight into how alpha0
approaches the SPRO this problem
you see here on the right Stefan's like
the author and the book tells the story
of dr. B an innocent man who has been

English: 
requires us, the agent to be
able to solve more than one
game, maybe three games, you
know, not much, but a little
better.
And so I'd like to talk about
AlphaZero and how that
manages not only to play two
more games but also use much
less human knowledge, because
remember AlphaGo was still using
those professional game records
and also some additional
features that we had designed.
And so what's interesting is
that a short story by Stefan
Zweig, "The Royal Game", can
give us some insight into how
AlphaZero approaches this
problem.
You see here on the right,
Stefan Zweig, the author, and
the book tells the story of Dr.
B, an innocent man who has been

English: 
arrested and is being held in solitary
confinement not unlike our learning
agents that the B is alone in his small
world and starved for stimulation I
quote they did nothing other than
subjecting us to complete nothingness
for as is well known nothing on earth
puts more pressure on the human mind
than nothing while waiting for an
interrogation dr. B manages to steal a
book from one of his captors a book
about the game of chess eager to engage
his mind dr. B devours the book and
learns to play chess on a makeshift
board in his cells he replaced the
master games from the book over and over
again but after a few weeks the games
from the book have lost their novelty
desperately looking for further
diversion dr. B attempts to play chess

English: 
arrested and is being held in
solitary confinement. Not unlike
our learning agents,
Dr. B is alone in his small
world and starved for
stimulation.
I quote, "They did nothing other
than subjecting us to complete
nothingness for, as is well known,
nothing on earth puts more
pressure on the human mind than
nothing." While waiting for an
interrogation, Dr. B manages to
steal a book from one of his
captors, a book about the game
of chess.
Eager to engage his mind, Dr. B
devours the book and learns to
play chess. On a makeshift board
in his cell,
he replays the master games
from the book over and over
again.
But after a few weeks, the games
from the book have lost their
novelty. Desperately looking for
further diversion, Dr. B

English: 
attempts to play chess against
himself, but he soon realises
that he can only play against
himself if he splits his mind
into two halves, an I (Black)
and an I (White). Only now,
with two agents in play, true
interaction and learning can
happen.
Years later, on a cruise ship,
Dr. B meets the World Chess
Champion of the time, one Mirko
Czentovich, an expert at chess
and only at chess.
In a stunning demonstration of
his skills, Dr. B manages to do
the impossible.
He wins at chess against the
World Chess Champion.
Now, fast forward 80 years and
Stefan Zweig's story becomes
reality in a way that not even
the author could have imagined,
and he could imagine a lot. The

English: 
against himself but he soon realized
that he can only play against himself if
he splits his mind into two halves and I
black and then I white only now with two
agents in play through interaction and
learning can happen
later on a cruise ship dr. B meets the
world chess champion of the time when
Mirko sent ovitch an expert at chess and
only at chess in a stunning
demonstration of his skills dr. B
manages to do the impossible he wins at
chess against the world chess champion
now fast forward 80 years and Stefan
Sykes story becomes reality in a way
that not even the author could have
imagined and he could imagine a lot

English: 
the modern center which stock fish the
world computer Chess Champion 2016 a
good old-fashioned AI for playing chess
and only chess the modern dr. B I would
argue alpha0 an artificial agent that
learns to play chess solely by playing
against itself now here you see some
results as white alpha zero wins almost
30 percent of its game against his games
against stock fish and as black it
manages to draw most of the time and
even wins the humor games then stock
fish does you have to imagine stock fish
is an is a good old-fashioned AI a
program that has been designed by people
by chess experts and so on and uses an
enormous number of heuristics to cut
down the search tree and uses all kinds

English: 
modern Czentovich? Stockfish, the
World Computer Chess Champion
2016, a good old-fashioned AI
for playing chess and only
chess. The modern Dr. B I would
argue? AlphaZero,
an artificial agent that learns
to play chess solely by playing
against itself.
Now, here you see some resuls.
As white, AlphaZero wins almost
30 percent of its games against
Stockfish, and as black, it
manages to draw most of the time
and even wins a few more games
than Stockfish does.
You have to imagine Stockfish is
a good old-fashioned AI program
that has been designed by
people, by chess experts and so
on, and uses an enormous number
of heuristics to cut down the
search tree and uses all kinds
of domain knowledge about chess.

English: 
Now, here you see the development
over time as AlphaZero trains.
These are thousands of steps, and
you see here the Elo number.
That's how we measure success
here.
And after roughly four hours of
training, AlphaZero surpasses
Stockfish in its chess skills.
So, how does this work?
The trick, of course, is a form
of reinforcement learning and
self-play,
as you may have inferred from
the story. AlphaZero is also all
alone in some sense playing
chess against itself.
And here's how it works.
So, initialised with the policy
and value network, AlphaZero
plays by evaluating the search
tree from a given position and
making its best move, then
taking that next position, again
taking policy and value net and
tree search to evaluate its next
move, and so on and so forth.
It plays and plays and plays

English: 
of domain knowledge about chess now here
you see the development over time as
alpha zero trains this is the route
these are thousands of steps and you see
here the e low number that's how we
measure success here and after roughly
four hours of training alpha zero
surpasses stock fish in its chess
skills
so how does this work the trick of
course is a form of reinforcement
learning and self play as you may have
inferred from this story alpha zero is
also all alone in some sense playing
chess against itself and here's how it
works
so initialized with the policy and value
network alpha zero plays by evaluating
the search tree from a given position in
making its best move then taking that
next position again taking policy and
value net and tree search to evaluate
its net next move and so on and so forth

English: 
and generates a lot of games at
its current level of chess,
which is very low at the
beginning because it's just
starting, and P and V are at this
point almost random. But then -
now we have games generated, and
we can train the policy network
because now we have a position
and we know which move was made,
and we can train the policy
network to imitate that move.
We can do that for the next
position, and the next position,
and so on.
The move made by AlphaZero
previously is the label from
which the policy network learns.
It's basically imitating itself,
augmented by search.
Similarly, we can train the value
network and predict the winner
of these games because we know
those games. We've played them
all the way to the end.
So, for a given position, we know
how it will end.
We can train a neural network
that estimates that. Now, finally,
we put these new policy and
value networks into the tree

English: 
it plays in place in place and generates
a lot of games at its current level of
chess which is very low at the beginning
because it's just starting and P and V
there are at this point almost random
but then now we have games generated and
we can train the policy network because
now we have a position and we know which
move was made and we can train the
policy network to imitate that move we
can do that for the next position and
the next position and so on the move
made by alpha zero previously is the
label from which the policy network
learns it's basically imitating itself
augmented by search similarly we can
train the value Network and predict the
winner of these games because we know
those games we've played them all the
way to the end so for a given position
we know how it will end and we can train
in your network that estimates that now

English: 
finally we put these new policy and
value networks into the tree search and
generate new games let alpha zero play
against itself but now at a higher level
because the new policy network and the
new value network are better and hence
together with the tree search they
generate better moves and we can
generate higher quality games of chess
that's how it works and it actually
works not only for chess but also for go
and for shogi and these are just
comparisons to how long it takes to over
to reach the levels of the more
traditional kind
tenders for being the best programs in
this space now one thing that I found
very interesting is how does alpha zero
reason about chess positions in order to
appreciate that you need to understand
that a classical chess program on the
left evaluates tens of millions of

English: 
search and generate new games.
Let AlphaZero play against
itself,
but now at a higher level
because the new policy network
and the new value network are
better and hence, together with
the tree search, they generate
better moves and we can generate
higher quality games of chess.
That's how it works.
And it actually works not only
for chess but also for Go and
for shogi, and these are just
comparisons to how long it takes
to reach the level of the more
traditional contenders for being
the best programs in this
space.
Now, one thing that I found very
interesting is: How does
AlphaZero reason about chess
positions. In order to
appreciate that, you need to
understand that a classical
chess program on the left
evaluates 10s of millions of
positions before it makes every
move.
It expands the tree to that many

English: 
positions before it makes a remove it
expands the tree to that many nodes now
alpha zero only expands about tens of
thousands of positions less by a factor
of a thousand
so it's search is much more focused and
of course that still a far cry from how
human grandmasters operate because they
only evaluate hundreds of position their
intuition for chess is so great that
they are very good at both selecting the
lines they look at and evaluating the
resulting positions but you see here
that in some sense we've made a move
from this brute force approach towards
the smarter way of solving these
problems that humans employ one thing I
like is that we actually discovered some
chess knowledge here or not we actually
alpha zero did you know for example
traditional openings like the English
opening was discovered by alpha zero and
it continues to play it there's there's
other openings that are also known to

English: 
nodes. Now, AlphaZero only
expands about 10s of thousands
of positions less by a factor of
a thousand.
So, its search is much more
focused.
And, of course, that's still a far
cry from how human Grandmasters
operate because they only
evaluate hundreds of positions.
Their intuition for chess is so
great that they are very good at
both selecting the lines they
look at and evaluating the
resulting positions.
But you see here that, in some
sense, we've made a move from
this brute-force approach
towards the smarter way of
solving these problems that
humans employ.
One thing I like is that we
actually discovered some chess
knowledge here or not we actually,
AlphaZero did.
For example, traditional
openings, like the English
Opening, was discovered by
AlphaZero, and it continues to
play it.
There's other openings that are
also known to humans that are
discovered but then discarded,

English: 
humans that are discovered
but then discarded you know not good
enough alpha zero understood that that
line although it's been played by humans
for a long time you know just not good
enough I want to give you one example of
play of alpha zero this is my favorite
game it's called the immortal souped
swung game super strong is a german word
that indicates a situation in which it's
not actually advantageous for a site to
move but they'd rather stay still and do
nothing but the rules of chess don't
allow that so what you see here is alpha
zero as white and stockfish as black and
you see here that whites pieces are very
active and black is very crammed into
this corner with the Queen in the corner
blocked by that rock the King also
blocked in these two
rocks protecting that porn it doesn't
look good and now we can take a look at

English: 
you know, not good enough.
AlphaZero understood that that
line, although it's been played
by humans for a long time, just
not good enough.
I want to give you one example
of play of AlphaZero.
This is my favourite game.
It's called The Immortal
Zugzwang Game. "Zugzwang" is a
German word that indicates a
situation in which it's not
actually advantageous for a side
to move but they'd rather stay
still and do nothing.
The rules of chess don't allow
that.
So, what you see here is
AlphaZero as white and Stockfish
as black. And you see here that
white's pieces are very active
and black is very crammed into
this corner, with the queen in
the corner, blocked by that
rook,
the king also blocked in these
two rooks protecting that pawn.
It doesn't look good. And now we
can take a look at which moves
actually or which pieces black
cannot move

English: 
in this situation. And the
tragedy of the position, if you
like, is that moving any of
these pieces leads to a loss for
black.
So, you see that AlphaZero has a
real appreciation of these
positional advantages, the
mobility of the pieces and
dominating the board.
Okay, let's conclude. Deep
learning helps us conquer this
huge search space, and self-play
produces the large amount of
data that we need to train these
deep neural networks.
There's also another thing at
play, which we call an automatic
curriculum because, you know, at
the beginning the system
starts to play in a very weak
fashion and, as it becomes
better, its opponent also
becomes better, and so it always
trains against an opponent
that's just at the right level,
because it's training against
itself. And that automatic
curriculum leads to stronger and

English: 
which moves actually or which pieces
black cannot move in this situation and
the tragedy of the position if you like
is that moving any of these pieces leads
to a loss for black and so you see that
alpha zero has a real appreciation of
these positional advantages the mobility
of the pieces and and dominating the
board okay let's conclude the learning
helps us conquer this huge search space
and self play produces the large amount
of data that we need to train these deep
neural networks there's also another
thing at play which we call an automatic
curriculum because you know at the at
the beginning the system starts to play
in a very weak fashion and as it becomes
better its opponent also becomes better
and so it always trains against opponent
that's just of the right level because
it's training against itself and that
automatic curriculum leads to stronger

English: 
and stronger play and the system
discovers new knowledge which i think is
is a beautiful property of AI systems
there's still many open questions it's
just a game and relatively small spaces
as opposed to real world situations but
it's an interesting first step there's
more material that you can look at if
you're interested now let's move to a
more action filled game learning to play
the game of capture the flag this is
based on a recent science paper with max
jailer burg Wojtek Ronettes key and in
Dunning and it's about playing the game
of capture the flag so it's a large
scale a decentralized multi agent
learning problem because capturing the
flag is an objective game where multiple
agents need to learn to interact to play
it well and you see here a first-person

English: 
stronger play. And the system
discovers new knowledge, which I
think is a beautiful property of
AI systems.
There's still many open
questions. It's just a game and
relatively small spaces as
opposed to real world situations,
but it's an interesting first
step.
There's more material that you
can look at, if you're
interested.
Now, let's move to a more action-
filled game:
learning to play the game of
Capture the Flag.
This is based on a recent
Science paper with Max Jaderberg,
Wojciech Czarnecki, and Iain Dunning,
and it's about playing the game
of Capture the Flag.
So, it's a large-scale
decentralised multi-agent
learning problem because
Capture the Flag is an objective
game where multiple agents need
to learn to interact to play it
well. And you see here a first-
person perspective and here a
top-down view of a typical game

English: 
perspective and here a top-down view of
a typical game situation
how does this game work well this we
play it as a 2v2 game the you go run to
the opponent base and pick up the flag
you want to bring it back to your own
base but you need to make sure that when
you capture it that your own flag is at
your own base let's take a look at this
from the game's perspective here this is
the agent perspective and this is the
top-down perspective and you'll see how
this game works so you see the blue
agents there they're going to the
towards the red flag and capture it and
they want to now bring it back to their
own base and the trick is that you
really need to have your flag at your
base to score and that's that's why you
need coordination for example now their
flag has been stolen by the red agent
and so they need to tag that red agent

English: 
situation.
How does this game work?
Well, we play it as a 2 v 2 game.
You run to the opponent base and
pick up the flag.
You want to bring it back to
your own base, but you need to
make sure that, when you capture
it, that your own flag is at
your own base.
Let's take a look at this from
the game's perspective here.
This is the agent perspective,
and this is the top-down
perspective and you'll see how
this game works.
So, you see the blue agents
there.
They're going towards the red
flag and capture it, and they
want to now
bring it back to their own base.
And the trick is that you really
need to have your flag at your
base to score,
and that's why you need
coordination. For example,
now their flag has been stolen
by the red agent, and so they
need to tag that red agent and
get their flag back before they
can score,

English: 
and get their flag back before they can
score score a flag of their own now the
types of environments that you see here
we have - we have an outdoor version you
know one that's placed in some kind of
desert setting and here we have an
indoor version we wanted to show that
the system can handle these two types of
very different terrains and specifically
what's interesting here is we use
procedural generation whereas you and I
we would probably play this game on the
same map every time or maybe it just
changed maps a few times what we require
the agents to do here is to play on a
different map every single time and here
you see a sample of these maps they all
look different and what that does is
when the agents learn to play on these
varied maps they learn to generalize
they learn robust strategies that work
in all kinds of maps rather than just
rote learning a particular map and so
the trick is also to train a population

English: 
score a flag of their own.
Now, the types of environments
that you see here, we have two,
we have an outdoor version, you
know, one that's placed in some
kind of desert setting,
and here we have an indoor
version.
We wanted to show that the
system can handle these two
types of very different
terrains.
And, specifically, what's
interesting here is we use
procedural generation. Whereas
you and I, we would probably
play this game on the same map
every time or maybe just
change maps a few times, what we
require the agents to do here
is to play on a different map
every single time.
And here you see a sample of
these maps.
They all look different.
And what that does is, when
the agents learn to play on
these varied maps, they learn to
generalise.
They learn robust strategies
that work in all kinds of maps,
rather than just rote learning a
particular map.
And so the trick is also to

English: 
of agents and you see here our training
setup there's a population of agents
down here and they connect to what we
call arenas each one of these is a
little game simulation and some sample
of these this population connects to the
games for agents to on blue side to on
red side they play they learn they get
feedback they win or they lose and then
that stream of experience is routed back
to those agents and that's where the
neural networks are updated and they
learn and they are trained pretty much
independently other than through their
interaction in the arena
Kirsten you're a network architecture
that we use it's a two level hierarchy
of recurrent neural networks a fast one
and a slow one so we can cover different
timescales and it also learns about the
reward signal because this problem is
very difficult if you only ever get a
reward signal at the end you know you
imagine you play a five-minute game of

English: 
train a population of agents, and
you see our training set-up.
There's a population of agents
down here, and they connect to
what we call arenas.
Each one of these is a little
game simulation, and some sample
of this population connects to
these games. Four agents, two on
blue side, two on red side.
They play.
They learn.
They get feedback.
They win or they lose.
And then that stream of
experience is routed back to
those agents, and that's where
the neural networks are updated,
and they learn. And they all
train pretty much independently,
other than through their
interaction in the arena.
Here's the neural network
architecture that we use.
It's a two-level hierarchy of
recurrent neural networks, a
fast one and a slow one so we
can cover different timescales.
And it also learns about the
reward signal because this
problem is very difficult if you
only ever get a reward signal at
the end.
You know, imagine you play a
five-minute game of capture the

English: 
capture the flag and in the end someone
tells you oh you just won or you just
lost and you now need to figure out
going back in the game what was it that
made me win or what was it that made me
lose wouldn't it be much better if you
had some intermediate rewards like oh "I
just tag the opponent that must have
been a good thing" or "I just got the flag
off the opponent" and so that's what
happens at this top level when we learn
those rewards we also use a population
of agents because that allows us to get
some robustness towards different
playing styles different agents learn to
play the game in different ways and so
when agents train against these
different agents they again need to
develop robust strategies in order to
succeed in this game so here you see the
results and again we measure agent skill
in elo so the higher it is the better

English: 
flag, and in the end someone
tells you "oh, you just won" or
"you just lost", and you now need
to figure out, going back in the
game,
what was it that made me win or
what was it that made me lose?
Wouldn't it be much better if
you had some intermediate
rewards like "oh, I just tagged
the opponent,
that must have been a good
thing" or "I just got the flag
off the opponents".
And so that's what happens at
this top level when we learn
those rewards.
We also use a population of
agents because that allows us to
get some robustness towards
different playing styles.
Different agents learn to play
the game in different ways.
And so, when the agents train
against these different agents,
they again need to develop
robust strategies in order to
succeed in this game.
So, here you see the results.
And again we measure agent skill
in Elo.
So, the higher it is, the better.
And this is how they develop
over time.

English: 
You see here, this is the
baseline, a random agent that
doesn't do much, it's just
jittering about, you know,
randomly. Naive self-play doesn't
quite work, as you can see here,
it's almost as bad as the
random.
Here is the baseline of an
average human.
And here's the baseline of a
strong human.
And you see that the agent, the
deep-learning agent developed in
this work, learns and learns and
learns, surpasses the average
human, surpasses the strong
human,
and ends up with a much higher
skill than all of them.
Now, one thing that we found
particularly nice is, when we
tested with humans, we also let
them fill in some questionnaires,
and we asked them, "so, which
ones of your teammates was most
collaborative?" you know, "with
whom could you work the best?"
And it turns out that the human
players indicated that they
liked it best to play with the
AI.
You know, they found it reliable

English: 
and this is how they develop over time
you see here this is the baseline a
random atrum that doesn't do much it's
just chittering about you know randomly
naive self play doesn't quite work as
you can see here it's almost as bad as
the random here is the baseline of an
average human and here's the baseline of
a strong human and you see that the
agent the deep learning aid redeveloped
in this work learns and learns and
learns surpasses the average
surpasses the strong human and ends up
with a much higher skill than all of
them now one thing that we found
particularly nice is when we test it
with humans we also let them fill in
some questionnaires and we asked them so
which ones of your teammates was most
collaborative you know with whom could
you work the best and it turns out that
the human players indicated that they
liked it best to play with the AI big
you know they found it reliable and

English: 
strong and you know they wanted to play
with the AI so a nice finding now I'd
also like to use this example to show
you something that is maybe a little
underappreciated and it's the idea of
understanding how these trained agents
that behave in these arguably quite
clever ways in these environments
represent the world and in order to do
that we've here done a tease me
embedding like a two-dimensional
embedding of the internal states of the
agents as they play this game and we can
color the points by our knowledge of the
game situation for example here we know
that this these points represent
situations in which the agent flag is at
the base the opponent flag is at their
base and the agent is in their home base
you know we know that because we can
look at the game but the agents also
know that you know they know that this

English: 
and strong.
And, you know, they wanted to play
with the AI. It's a nice
finding.
Now, I'd also like to use this
example to show you something
that is maybe a little under-
appreciated, and it's the idea of
understanding how these trained
agents that behave in these
arguably quite clever ways in
these environments represent
the world.
And in order to do that, we've
here done a t-SNE embedding,
like a two-dimensional embedding,
of the internal states of the
agents as they play this game,
and we can colour the points by
our knowledge of the game
situation.
For example, here we know that
these points represent
situations in which the agent
flag is at the base, the
opponent flag is at their base,
and the agent is in their home
base.
You know, we know that because
we can look at the game, but the
agents also know that. You know,
they know that this is a
particular type of situation, and

English: 
is a particular type of situation and
they represented by internal activations
that are all similar to one another and
that are different from a different
situation like here where the agent flag
is taken the opponent flag is held by
the teammate and so somehow in their
internal representations they've learned
what a given situation is like and as a
consequence they can use that to make
the make good decisions about how to
play in those situations and good
decisions they make here are some
patterns of behavior that we've observed
and here for example home base defense
they try to defend their own home base
they sometimes camp in the opponent home
base you know they just
wait there until the flag responds and
they can they can steal it or they can
follow the teammate and really just work
together in in the same area and it
turns out that this most advanced agent
is actually most similar in playing
style to to the humans that we observed
playing and I just want to emphasize

English: 
they are represented by internal
activations that are all
similar to one another and that
are different from a different
situation, like here where the
agent flag is taken, the
opponent flag is held by the
teammate, and so somehow in their
internal representations they've
learned what a given situation
is like.
And as a consequence, they can
use that to make good
decisions about how to play in
those situations. And good
decisions they make.
Here are some patterns of
behaviour that we've observed, and
here, for example, the home base
defence.
They tried to defend their own
home base.
They sometimes camp in the
opponent's home base.
You know, they just wait there
until the flag respawns and they
can they can steal it or they
can follow the teammate and
really just work together in the
same area.
And it turns out that this most
advanced agent is actually most
similar in playing style to the
humans that we observed playing.
And I just want to emphasise
that there's a whole dimension

English: 
there's a whole dimension of research
out there where we not only train these
systems but we look at their internal
representation how do they represent the
world and we look at their behavior and
really understand how they solve the
problem so that we can learn from that
but also ensure that they behave in safe
ways for example if it's a safety
critical problem good now we've seen
that in the in this more complex
multiplayer game the agents can clearly
reach human level it takes a lot of
compute to achieve that no doubt
training populations is important
because you want a diverse training
signal and the self play that still
worked for alphago for alpha zero didn't
actually work here because they just
learned one strategy and there wasn't
enough diversity and robustness in there
a second thing of course that makes
things robust is that we have diversity
of environments because they we

English: 
of research out there where we
not only train these systems, but
we look at their internal
representation.
How do they represent the world?
And we look at their behaviour
and really understand how
they solve the problem, so we
can learn from that but also
ensure that they behave in safe
ways,
for example, if it's a safety
critical problem. Now, we've seen
that in this more complex
multiplayer game the agents can
clearly reach human level.
It takes a lot of compute to
achieve that,
no doubt. Training populations is
important because you want a
diverse training signal, and the
self-play that still worked for
AlphaGo, for AlphaZero, didn't
actually work here, because they
just learned one strategy and
there wasn't enough diversity
and robustness in there.
The second thing, of course, that
makes things robust is that we
have diversity of environments

English: 
because we procedurally generate
them, and we can begin to
understand how these agents
behave and why. There's papers
and blog posts on this that you
can read to pick up more detail.
Now, as the third case study I
would like to go beyond games
and talk about how we can use
deep learning to learn to fold
proteins.
The particular project I'll talk
about is called AlphaFold and
the paper is called "Improved
proteins structure prediction
using potentials from deep
learning" and is work by my
colleagues Andrew Senior,
Richard Evans, John Jumper and
James Kirkpatrick, and many
others.
So, what is this about?
What is protein folding?
First, let's understand what
proteins are.
Some of you may know more about
this than I do, but here's the
gist.

English: 
procedurally generate them and we can
begin to understand how these agents
behave and write this papers and blog
posts on this that you can read to pick
up more detail now as the third case
study I would like to go beyond games
and talk about how we can use deep
learning to learn to fold proteins and
the particular project I'll talk about
is called alpha fold and the paper is
called improved protein structure
prediction using potentials from deep
learning and is worked by my colleagues
Andres senior Richard Evans John jumper
and James Kirkpatrick and many others so
what is this about what is this protein
forwarding
first let's understand what proteins are

English: 
some of you may know more about this
than I do and but here's the gist there
are the fundamental building blocks of
life they carry out all kinds of
functions in our bodies they catalyze
reactions they transduce signals across
the cell membrane they regulate genes
they do cellular transport they they
provide antibodies and they're very
important for clinical drugs often they
are the target of particular drugs but
also many drugs are proteins and the key
thing we need to know to understand the
protein is its shape what the shape of a
given protein on the right here you see
this amazing animation of proteins in
action and you can see they really act
as molecular machines you know then they

English: 
They are the fundamental
building blocks of life.
They carry out all kinds of
functions in our bodies:
they catalyse reactions;
they transduce signals across
the cell membrane;
they regulate genes;
they do cellular transport;
they provide antibodies; and
they're very important for
clinical drugs.
Often they are the target of
particular drugs, but also many
drugs are proteins, and the key
thing we need to know to
understand a protein is its
shape. What's the shape of a
given protein? On the right here
you see this amazing animation
of proteins in action.
And you can see they really act
as molecular machines, and they
can fulfil an amazing diversity
of functions in the body,

English: 
depending on their shape.
Now, the interesting thing is
that, in some sense, the
specification of a protein is
just a chain of amino acids.
It's a sequence of amino acids
from an alphabet of 20 different
amino acids, and what happens is
that these amino acids, they
interact locally to form these
shapes, for example, helices or
sheets, and then these helices
and sheets, they interact more
globally to form the 3D shape of
the overall protein, and then
proteins can interact to do all
of these amazing things.
Now, the problem that we want to
solve is that of protein
structure prediction, and you
have to imagine that these
proteins,
they have a backbone.

English: 
can fulfill an amazing diversity of
functions in the body depending on their
shape now the interesting thing is that
in some sense the specification of a
protein is just a chain of amino acids
it's it's a sequence of amino acids from
an alphabet of 20 different amino acids
and what happens is that these amino
acids they interact locally to form
these shapes for example HeLa C's or
sheets and then these HeLa season sheets
they interact more globally to form the
3d shape of the overall protein and then
proteins can interact to do all of these
amazing things now the the problem that
we want to solve is that of protein
structure prediction and you have to
imagine that these proteins they have a

English: 
backbone it's basically the kind of the
main chain that determines their shape
and they have these little side chains
that that influence how these how this
backbone interacts with itself as it
folds back on itself and if we can
figure out from the sequence you know
this is the sequence of amino acids what
the 3d shape of the protein is then we
can understand what this protein does
because it acts when it's in this 3d
shape you can also imagine that if we
have a shape in mind that we want to
create it would be really good to have
this kind of mapping available because
then we could invert it and from the
desired shape device the sequence of
amino acids that we would need to create
in order to build the thing that falls
into that particular shape it's known as
the inverse problem now how can we think
about the shape of these proteins the

English: 
It's basically the kind of the
main chain that determines their
shape.
And they have these little side
chains that influence how this
backbone interacts with itself
as it folds back on itself.
And if we can figure out from
the sequence, you know, this is
the sequence of amino acids,
what the 3D shape of the
protein is, then we can
understand what this protein
does, because it acts when it's
in this 3D shape
You can also imagine that if we
have a shape in mind that we
want to create, it would be
really good to have this kind of
mapping available because then
we could invert it and, from the
desired shape, devise the
sequence of amino acids that we
would need to create in order
to build the thing that folds
into that particular shape. It's
known as the inverse problem.
How can we think about the shape
of these proteins?

English: 
goal really is to predict for every atom
in this protein exactly where it ends up
when it folds but one way of
parameterizing this is through the
torsion angles and once we have all
these 2n torsion angles which are the
angles how these different bonds connect
to one another then we know the 3d shape
just imagine if you wanted to determine
the shape of these things someone told
you which way they rotate at each point
and you would be able to figure out
where they are so those are the
parameters that we're looking for once
we have these angles then we should be
done but there's a paradox and that is
called Leventhal paradox and it's
basically the following many naturally
occurring proteins fold reliably and
quickly to their native state despite

English: 
The goal really is to predict
for every atom in this protein
exactly where it ends up when it
folds. But one way of
parameterising this is through the
torsion angles, and once we have
all these 2N torsion angles,
which are the angles,
how these different bonds
connect to one another,
then we know the 3D shape.
Just imagine if you wanted to
determine the shape of these
things, someone told you which
way they rotate at each point.
Then you would be able to figure
out where they are.
So, those are the parameters that
we're looking for.
Once we have these angles, then
we should be done. But there is a
paradox, and that is called
Levinthal's paradox and is
basically the following: "Many
naturally-occurring proteins
fold reliably and quickly to
their native state
despite the astronomical number
of possible configurations."

English: 
the astronomical number of possible
configurations so how if there are so
many ways in which these things can fold
how do they find the right way so that
they end up in exactly the shape that's
necessary in their living organism and
I've done a little example here so
suppose we have a chain length of 361
amino acids
and just at any point there are three
different ways in which they could fold
then we would have three ^ 361 is
roughly ten to the 172 configurations in
which they could fold now imagine these
proteins can wiggle really quickly and
they can explore ten to the thirteen
different configurations per second or
ten to the twenty per year that seems
like really quick search through that
space right but it would still take 10
to the 152 years to sample all the
possible configurations it's a huge
space and for those of you who know the
game of Go I have chosen 361 is the

English: 
So, how? If there are so many
ways in which these things can
fold,
how do they find the right way
so that they end up in exactly
the shape that's necessary in
the living organism?
And I've done a little example
here.
So, suppose we have a chain
length of 361 amino acids and
just at any point there are
three different ways in which
they could fold.
Then we would have three to the
power of 361 is roughly 10 to
the 172
configurations in which they
could fold.
Now, imagine these proteins can
wiggle really quickly and they
can explore 10 to the 13
different configurations per
second or 10 to the 20 per year.
That seems like really quick
search through that space,
right? But it would still take
10 to the 152
years to sample all the possible
configurations.
It's a huge space, and for those
of you who know the game of Go,
I've chosen 361
as the example on

English: 
purpose here because that's the
number of vertices on a Go board,
and 3 to the power of
361 is an upper
bound on the number of legal Go
positions, of all the kinds of
ways in which you can have a Go
position. And John Tromp actually
calculated how many legal
positions there actually are in
Go. And, you see, I tend to get
sidetracked to Go because I just
love the game. And he represents
that number as a ternary number,
and a ternary number - because any
of these points can have three
states, empty, black or white - can
be represented as a Go position.
And so this is the Go position
that, if you read it as a ternary
number, represents the number of
legal positions in the game of
Go.
Now, you might be wondering is it
a legal position?
It is an illegal position.
You notice the black stone here
where it shouldn't be.
And another one here.

English: 
example on purpose here because that's
the number of vertices and a go board
and three ^ 361 is an upper bound on the
number of legal go positions of all the
kind of ways in which you can have a go
position and John trump actually
calculated how many legal positions
there actually are in go and you see I
tend to get sidetracked to go because I
just love the game and he represents
that number as a ternary number and a
ternary number because any of these
points can have three states empty black
or white can be represented as a go
position and so this is the go position
that as rare if you read it as a ternary
number represents the number of legal
positions in the game of goal now you
might be wondering is it a legal
position it is an illegal position you
know there's a black stone here where it
shouldn't be and another one here and

English: 
that's actually accurate because the
majority of these configurations of a go
board are in fact not legal positions
but back to $11 per adducts there's a
huge search space but we know that deep
learning can do something about deep
search spaces so let's do that so why do
we want to use deep learning for protein
folding there are experimental methods
of course to determine the structure of
of proteins
but it's a very difficult modeling
problem we have data available 150,000
proteins in the protein data bank which
was founded in 1971 so this is a long
ongoing process but we have much less
data than for some of the other tasks
like speech recognition or image
recognition so it's a little harder
there's another advantage there is CASP
which is an assessment some kind of

English: 
And that's actually accurate
because the majority of these
configurations of a Go board are,
in fact, not legal Go positions.
But back to Levinthal's
paradox, there's a huge search
space but we know that deep
learning can do something about
deep search spaces, so let's do
that.
So, why do we want to use deep
learning for protein folding?
There are experimental methods,
of course, to determine the
structure of proteins, but it's a
very difficult modelling
problem.
We have data available,
150,000 proteins
in the protein data bank, which
was founded in 1971.
So, this is a long, ongoing
process, but we have much less
data than for some of the other
tasks like speech recognition or
image recognition,
so it's a little harder.
There's another advantage. There
is CASP, which is an assessment,
some kind of competition, that

English: 
provides a benchmark for protein
folding,
so there's a way of testing how
well the system does.
So, what should we predict?
It turns out that the 3D
structure of such a protein is
fully described by a
pairwise distance matrix.
So, if you have all of these
points in space, if you know all
the pairwise distances between
them, then you know what that
configuration in space looks
like.
And so the main thing that's
being predicted in this system
is this distance matrix.
Conveniently, it looks a bit like
an image and, you know, images
we're good at addressing with
deep learning.
Here's how the AlphaFold system
works.
The sequence comes in here, and
we generate certain features
from databases, so it's not quite
just using the raw data but it's
pulling in features about these
sequences from databases. And
then it does its distance
predictions,

English: 
competition that provides a benchmark
for protein folding so there's a way of
testing how well a system does so what
should we predict it turns out that the
3d structure of such a protein is fully
described by a dispair wise distance
matrix so if you have all of these
points in space if you know all the
pairwise distances between them then you
know what that configuration in space
looks like and so the main thing that's
being predicted in this system is this
distance matrix conveniently it looks a
bit like an image and you know images
we're good at at addressing with deep
learning here's how the Alpha fold
system works the sequence comes in here
and we generate certain features from
databases so it's not quite just using
the raw data but it's pulling in
features about these sequences from
databases and then it does its distance

English: 
also some angle predictions, and
it produces a score function, and
this score function is a number
that measures for a given
folding configuration for that
sequence,
how good that folding
configuration is. And it's
differentiable. And if it's
differentiable, we can do
gradient descent and we can
optimise it.
And that's really the key idea
behind this work.
I will not go into the details
of this deep dilated
convolutional residual
network, but my colleagues in the
future lectures will discuss
architectures in greater detail,
and you will have no trouble
understanding what's going on
here
once you've been through their
lectures. Let just one thing be
said, it's a very, very deep
neural network with 220 of these
blocks,
one after the other.
Good. How accurate are the

English: 
predictions also some angle predictions
and it produces a score function and
this score function is a number that
measures for a given forwarding
configuration for that sequence how good
that foldings configuration is and it's
differentiable and if it's
differentiable we can do gradient
descent and we can optimize it and
that's really the key idea behind this
work I will not go into the details of
this deep dilated convolution
convolutional residual network but my
colleagues in the future lectures will
discuss
architectures in greater detail and you
will have no trouble understanding
what's going on here
once you've been through their lectures
that's just one thing be said it's a
very very deep neural network with 220
of these blocks one after the other

English: 
predictions?
The first thing we can compare
is ground truth of this distance
matrix with the estimates of the
system, and we find that the
system does a good job at
capturing not only short-range
interactions but also the long-
range interactions.
And then we can compare the
foldings themselves, and these are
good examples where the folding
worked well, but you can see here
that blue, which are AlphaFold's
predictions, and green are
reasonably well aligned.
So, the system understands the
gist of these proteins and how
they fold.
The second step is this gradient
descent that we can do. Now that
we have a potential, an energy
function, if you like, that we
want to optimise,
we can use gradient descent on
these angles and just wiggle
down in this configuration space
and then you see that here in
action. It's literally optimising
the coordinates of all the

English: 
good how accurate are the predictions
the first thing we can compare is ground
truth of this distance matrix with the
estimates of the system and we find that
the system does a good job at capturing
not only short term interactions but
also long long-range interactions and
then we can compare the foldings
themselves and these are good examples
where the folding worked well but you
can see here that blue which are Alpha
false predictions and green are
reasonably well aligned so it the system
understands the gist of these proteins
and and how they forward the second step
is this gradient descent that we can do
now that we have a potential and energy
function if you like that we want to
optimize we can use gradient descent on
these angles and and just wiggle down in
these in this configuration space and
there you see that here in action it's

English: 
literally optimizing the coordinates of
all the different bits and folding in
the process trying to optimize the score
that has opted that it has estimated to
be a good potential for this problem
you started multiple times to discover
local optima but overall it's a gradient
descent operation now I said earlier
that one reason this was such a nice set
up for deep learning is that there's a
good assessment exercise and this is
cusp in this case calves 13 and it is a
it is a competition in which there are
82 chains that fold into some 3d
structures but they are not known to the
community yet they're kept secret and so
these chains are released one chain per
day for a period of time and then the
participating teams have three weeks to
return five guesses or predictions of
what they think that particular chain
will

English: 
different bits and folding in
the process, trying to optimise
the score that it has estimated
to be a good potential for this
problem.
You started multiple times to
discover local optima, but
overall it's a gradient descent
operation.
Now, I said earlier that one
reason this was such a nice
set-up for deep learning is that
there's a good assessment
exercise, and this is, in this case, CASP 13 and
it is a competition in which
there are 82 chains that
fold into some 3D structures,
but they are not known to the
community yet. They are kept
secret.
And so these chains are released
one chain per day for a period
of time, and then the
participating teams have three
weeks to return five guesses or
predictions of what they think
that particular chain will fold

English: 
into.
And it's a very popular
competition. Over 90 groups from
different labs across the world
participate, and then they have a
particular scoring mechanism,
which boils down to measuring
how close the predictions are to
the ground truth that has been
determined by those experimental
techniques.
And this is a fantastic piece of
work designing this competition,
and we're indebted to decades of
work here of those people who
run the competition or the
participants
but, of course, also those people
who do the experimental work of
producing this data because
that's hard work.
Some people estimate that, for
some proteins, it takes an entire
PhD thesis to get the 3D
structure.
That's a lot of work.
Okay.
So, here are the results.
These are the different teams
that participated.
And, in fact, the deep-learning

English: 
forward into and it's a very popular
competition over 90 groups from
different labs across the world
participate and then they have a
particular scoring mechanism which boils
down to measuring how close the
predictions are to the to the ground
truth that has been determined by those
experimental techniques and this is a
fantastic piece of work designing this
competition and were indepted two
decades of work here of those people who
run the competition or the participants
but of course also those people who do
with the experimental work of producing
this data because that's hard work some
people estimate that for some proteins
it takes an entire PhD thesis to get the
3d structure and entire PhD thesis
that's a lot of work ok so here are the
results these are the different teams
that participated and in fact the deep

English: 
learning system comes out as the best
performing system in this in this
exercise so this deep learning based
distance prediction clearly works it
gives more accurate predictions of
contact between residues although it
estimates distance previously a lot of
systems used contact when they're close
enough to go to one another
closer than a test rooms but it also
delivers richer information because you
can imagine this is there's more
information in a distance than in the
binary signal or if something is
together or further apart right we have
more information and finally because
it's a smooth prediction and we get
these distances it's real numbers it's
also a smoother potential that is easier
to optimize and we see that's why the
gradient descent for finding the right
car configurations works there are many
limitations still of course the a
curious accuracy of the system is still

English: 
system comes out as the best
performing system in this
exercise.
So, this deep learning-based
distance prediction clearly
works.
It gives more accurate
predictions of contact between
residues although it estimates
the distance. Previously, a lot of
systems used contact when
they're close enough to one
another, closer than eight
angstroms.
But it also delivers richer
information because, you can
imagine, there's more information
in a distance than in the binary
signal of if something is
together or further apart,
right,
we have more information.
And, finally, because it's a
smooth prediction and we get
these distances, it's real
numbers.
It's also a smoother potential
that is easier to optimise, and
we see that's why the gradient
descent for finding the right
configurations works.
There are many limitations
still, of course, the accuracy

English: 
limited it doesn't work so well on some
some proteins or protein templates the
method depends on what we pull out of
that database so only because there's a
database of
structures can we get the features that
allow us to make those predictions and
also it only predicts the backbone and
we then use a tool called Rosetta to
fill in the side chains so it's it's one
small step in in a problem that has been
thought about and worked on for many
decades but it shows that deep learning
has something to contribute to these
problems in science and specifically in
biology good so that was the three case
studies that I wanted to present alphago
the classic if you like with its

English: 
of the system is still limited.
It doesn't work so well on some
some proteins or protein
templates. The method depends on
what we pull out of that
database.
So, only because there's a
database of similar structures
can we get the features that
allow us to make those
predictions.
And, also, it only predicts the
backbone, and we then use a tool
called Rosetta to fill in the
side chains.
So, it's one small step in a
problem that has been thought
about and worked on for many
decades, but it shows that deep
learning has something to
contribute to these problems in
science and, specifically, in
biology.
So, that was the three case
studies that I wanted to present:
AlphaGo, the classic, if you
like,
with its extension into
AlphaZero, a board game, a video

English: 
extension into alpha zero a board game a
video game the game of capture the flag
with more players more richer
interactions harder to process visuals
and so on and finally an example from
the world of science where deep learning
can really make a contribution to
scientific progress in a field where
that matters because it might help
develop new new cures now I'd like to
use the rest of the time to give you an
overview of the field of deep learning
by going through the different lecture
topics that my colleagues are going to
talk about and here they are and we
start with number two because we just in
the process of doing number one the
first lecture away after this one is on
the foundations of neural networks and
will be developed at the delivered by
Wojtek Ronettes key and it will really
answer the questions what our neural

English: 
game, the game of Capture the
Flag with more players,
richer interactions, harder-to-
process visuals, and so on.
And, finally, an example from the
world of science, where deep
learning can really make a
contribution to scientific
progress in a field where that
matters because it might help
develop new cures.
Now, I'd like to use the rest of
the time to give you an overview
of the field of deep learning by
going through the different
lecture topics that my
colleagues are going to talk
about.
And here they are, and we start
with number two because we're just in
the process of doing number one.
The first lecture,
after this one, is on the
foundations of neural networks
and will be delivered by
Wojciech Czarnecki.
And it will really answer the
questions: What are neural

English: 
networks? What kinds of
functions can they represent?
How are they trained? You know,
back propagation and review of
those ideas, many of you will be
familiar with them.
But I can tell you, when Wojciech
explains these things, then they
become clearer than they have
ever been.
And, of course, also what are the
limitations of neural networks?
Limitations of neural networks
have led to the first neural
network winter many decades ago,
when neural networks with only a
single layer
were not able to solve the
simple XOR problem, and when only
adding a second layer to the
neural network then enables the
system to solve this kind of
problem.
So, these things are
important to know.
The second lecture after that or
lecture number three will be on
convolutional neural networks
for image recognition, given by
Sander Dieleman. And the idea
here is really to pick up on the
thought that we want to imbue
the neural network with a form

English: 
networks what kinds of functions can
they represent how are they trained you
know back propagation some review of
those ideas many of you will be familiar
with them but I can tell you when Roy
Tech explains these things then they
become clearer than they have ever been
and of course also what are the
limitations of neural networks
limitations of neural networks have led
to the first neural network winter men
decades ago when neural networks with
only a single layer were not able to
solve the simple XOR problem and when
only only adding a second layer to the
neural network then enables the system
to solve this kind of problem so that's
these things are important to know the
second lecture after that or a number
lecture number three will be on
convolutional neural networks for image
recognition given by sundar dilemma and
the idea here is really to pick up on
the on the thought that we want to imbue

English: 
of prior knowledge, because that
will make learning more
efficient and more data
efficient, and introducing
convolutional neural networks is
one way of doing that.
Convolutions encode a particular
weak prior about how images
behave.
For example, they can encode
translation invariance, you know, no matter
where an object is in the image,
it will still be the same thing.
And we'd like to encode these
things in our neural networks
and convolutional neural networks,
brainchild of Yann LeCun in his
LeNet-5 work. They really revolutionised
image recognition because they
made these neural networks
competitive for these tasks.
And, nowadays, every neural
network application in the area
of vision makes use of
convolutional neural networks.

English: 
the neural network with a form of prior
knowledge because that will make
learning more efficient than more data
efficient and introducing convolutional
neural networks is one way of doing that
convolutions encode a particular week
prior about how images behave for
example they can encode translation
invariant you know no matter where an
object is in the in the image it will
still be the same thing and we'd like to
encode these things in our neural
networks and convolutional neural
networks brainchild of Jung Lacan in his
little net five work they really
revolutionized image recognition because
they made these neural networks
competitive for these tasks and nowadays
every neural network application in the
area of vision makes use of

English: 
convolutional neural networks now in the
next lecture Viorica Petrossian will
talk about vision beyond image net and
object recognition and talk about more
advanced models and this might involve
object detection semantic segmentation
you know what is what in a given scene
estimation of optical flow she will also
talk about analysis of videos videos can
be viewed just as stacked images in time
and there are interesting tasks there
for example action recognition from a
single frame you might not know what
someone is doing but if you see a video
of it then there's an action going on
you might want to recognize
what that is you know for example
someone smashing in the window or doing
a sports exercise the next thing is self
supervised learning one problem that
deep learning is plagued with is that it
are it's supervised variant depends on

English: 
Now, in the next lecture, Viorica
Patraucean will talk about
vision beyond ImageNet and
object recognition and talk
about more advanced models.
And this might involve object
detection, semantic
segmentation,
you know, what is what in a given
scene, estimation of optical
flow.
She will also talk about
analysis of videos.
Videos can be viewed just as
stacked images in time, and there
are interesting tasks there.
For example, action recognition:
From a single frame, you might
not know what someone is doing,
but if you see a video of it,
then there's an action going on,
you might want to recognise what
that is.
For example, someone smashing in
a window or doing a sports
exercise.
The next thing is self-
supervised learning.
One problem that deep learning
is plagued with is that
its supervised variant
depends on labels.
You know, when we learn object

English: 
recognition, we need a photo and
a label what's in it, so to speak,
but in self-supervised learning,
we can learn a lot about the
world without labels.
And this is particularly
interesting if you have multiple
modalities. Think of a video on
YouTube. There's a video stream,
but there's also an audio stream,
and the audio can tell us
something about the video, and
the video can tell us something
about the audio.
And so we can learn
representations from that.
The next lecture is on
optimisation for machine
learning given by James Martens.
So, you can think of optimisation
as the engine that drives the
learning process and, of course,
it's a very old field,
how to optimise functions, and he
will focus in particular on
gradient-based optimisation
methods and their application to
training neural networks.
He'll cover gradient descent,
momentum methods, second-order
methods, and stochastic methods.

English: 
labels you know when we learn object
recognition we need a photo and a label
what's in it so to speak but in self
supervised learning we can learn a lot
about the world without labels and this
is particularly interesting if you have
multiple modalities think of a video on
YouTube there's a video stream but
there's also an audio stream and the
audio can tell us something about the
video and the video can tell us
something about the audio and so we can
learn representations from that the next
lecture is on optimization for machine
learning given by James Martin's so you
can think of optimization as the engine
that drives the learning process and of
course it's a very old field how to how
to optimize functions and he will focus
in particular on gradient based
optimization methods and their
application to training neural networks
he'll cover gradient descent momentum
methods second-order methods and

English: 
And, just to give you an example,
this is a visualisation of the
error surface or the loss
surface of a neural network.
You see it's a complex beast.
It has this fine substructure
everywhere, with local optima. And
there's a super interesting
result in particular in this
paper here,
"mode connectivity", that there
are all of these little optima,
these local optima, but there's a
path between them where, when you
walk along the parameter space
along this path, all the
solutions along that path are
relatively good solutions that
generalise well. Those connect
those modes and, you know, there's
a lot of complexity here in
those loss surfaces, and we need
to understand optimisation to
become good at training these
neural networks. We'll then move
on to sequences and recurrent
networks, delivered by Marta
Garnelo.
And this is again the idea that
we want to imbue the neural
network with prior knowledge.

English: 
stochastic methods and just to give you
an example this is a visualization of
the error surface of the lost surface of
a neural network you see it's a complex
beast it has this fine substructure
everywhere with local optima there's a
super interesting result in particular
in this paper here mode connectivity
that there are all of these little
Optima this local Optima but there's a
path between them where when you walk
along purl in the parameter space along
this path all the solutions along that
path are relatively good solutions that
generalize well and those connect those
modes and you know there's a lot of
complexity here in those lust surfaces
and we need to understand optimization
to to become good at training these
neural networks
I will then move on to sequences and
recurrent networks delivered by Marta
gar nello
and this is again the idea that we want

English: 
to imbue the neural network with prior
knowledge but this time its knowledge
about time its knowledge about sequences
maybe the idea that in a sequence for
what happens now more recent stuff have
matters more than stuff that happened
longer time ago and why is this
important well if you think about it a
lot of data comes in the form of
sequences speech text DNA sequences
video audio they're all sequences we
might think of vectors that's like the
first thing that we learned in machine
learning but almost all the interesting
stuff comes in sequences and so she will
discuss recurrent neural networks and
also the famous LST M's long short-term
memory which is a way of dealing with
the problem of vanishing gradients in
training recurrent neural networks
another interesting task here is
sequence to sequence learning suppose
you want to translate one language to
another

English: 
But this time it's knowledge
about time. It's knowledge about
sequences,
maybe the idea that in a
sequence, for what happens now,
more recent stuff matters more
than stuff that happened longer
time ago. And why is this
important?
Well, if you think about it, a lot
of data comes in the form of
sequences: speech, text, DNA
sequences, video, audio. They're
all sequences. We might think of
vectors.
That's like the first thing that
we learn in machine learning, but
almost all the interesting stuff
comes in sequences. And so she
will discuss recurrent neural
networks and also the famous
LSTMs,
long short-term memory, which is
a way of dealing with the
problem of vanishing gradients
in training recurrent neural
networks.
Another interesting task here is
sequence-to-sequence learning.
Suppose you want to translate
one language to another.
That's one sequence to another.

English: 
And so there's ways of training
neural networks to do that as
well.
We'll then move on to deep
learning for natural language
processing,
in some sense a special case of
the previous work. And Felix will
discuss why deep learning is a
good technique for language, and
he'll discuss simple recurrent
neural networks, applied to
language, but also more complex
models up to transformers, which
is one of the most successful
models.
He'll also talk about
unsupervised learning because
not every piece of text is
labeled.
So, what can we learn if we just
have a dump of Wikipedia or of
Reddit? A vast amount of text.
Can we learn something about
language from that without
having labels?
And, finally, he'll also talk
about subtle things like
situated language understanding.
What does the situation of a
particular agent tell us?
The grounding, the interaction

English: 
that's one sequence to another and so
there's there's ways of training neural
networks to do that as well we'll then
move on to deep learning for natural
language processing in some sense a
special case of the previous work and
Felix will discuss why deep learning is
a good technique for language and he'll
discuss simple recurrent your networks
apply to language but also more complex
models up to transformers which is one
of the most successful models he'll also
talk about unsupervised learning because
not every piece of text is labeled so
what can we learn if we just have a dump
of Wikipedia or reddit you know a vast
amount of text can we learn something
about language from that without having
labels and finally he'll also talk about
subtle things like situated language
understanding you know what does the
situation of a particular agent tell us

English: 
of language and the world.
Here's just a tiny example of
some recent very exciting work
in language modelling.
You know, where they train this
huge model,
I think on the Reddit corpus of
a lot of data, and it's a system
that you can prompt with some
input text, and it'll then produce a
continuation of that text, and it
produces very fluent text.
I'll just read the completion
but, you know, it's about, the
prompt is about scientists
discovering unicorns, you know,
pretty absurd, but the model just
keeps going.
"The scientist named the
population after the distinctive
horn of each unicorn.
These four-horned, silver-
white unicorns were previously
unknown to science."
You know, it kind of
superficially make sense.
You listen to it and it sounds
like English, right?
What's going on?
So, then Alex Graves will turn
your and his attention to
attention and memory in deep

English: 
the grounding the interaction of
language in the world here's just a tiny
example of some recent very exciting
work in language modeling you know where
they train this huge model I think on
the reddit corpus of a lot of data and
it's a system that you can prompt with
some input text and it will then produce
a continuation of that text and it
produces very fluent text I'll just read
the completion but you know it's about
the prompt is about scientists
discovering unicorns you know pretty
absurd but the model just keeps going
the scientists named the population
after the distinctive horn of its unique
urn these four horned silver white
unicorns were previously unknown to
science you know it's kind of
superficially makes sense you listen to
it and it sounds like English right
what's going on so then Alex graves will
talk it will turn your and his attention
to attention and memory in deep learning

English: 
which are emergent topics that are very
important we know of course that for
human cognition attention and memory are
very important but can you renege works
and body these ideas it turns out that
they can and you can even foreign normal
in your network see that it will pay
particular attention if you like to some
parts of the input rather than others to
solve particular tasks we call that
implicit attention but we can also make
that explicit we can build mechanisms
that allow the neural network to zoom in
on particular parts of the input to put
its attention there if you like
and the highlight of that lecture might
well be the idea of the neural Turing
machine or differentiable neural
computer where there's a neural network
controller that has access to an
external memory that can write to it and
read from it and it can solve problems

English: 
learning,
which are emerging topics that
are very important.
We know, of course, that for human
cognition, attention and memory
are very important, but can
neural networks embody these
ideas?
It turns out that they can, and
you can even for a normal neural
network see that it will pay
particular attention, if you
like, to some parts of the input
rather than others to solve
particular tasks.
We call that implicit attention.
But we can also make that
explicit.
We can build mechanisms that
allow the neural network to zoom
in on particular parts of the
input, to put its attention
there, if you like. And the
highlight of that lecture might
well be the idea of the Neural
Turing Machine or differentiable
neural computer, where
there's a neural network
controller that has access to an
external memory.
It can write to it and read
from it, and it can solve
problems in this way. For example,

English: 
in this way for example it can write the
graph of the of the London Underground
system into this memory and then answer
questions about it about the
connectivity of that graph so that
should be an interesting one
now then move to unsupervised learning
in particular generative latent variable
models and variational inference
by Andreini and I already mentioned
unsupervised learning is very important
because we don't have labels for a lot
of tasks and in particular and
interesting model that he will consider
is the auto encoder model at the
variational auto encoder model and the
idea here is that from data you might
want to infer some latent variable
responsible for generating that data for
example there's the image of a digit and
you might want to determine what the
label of that digit is but then given
the label of that digit you might want
to generate images of those digits and

English: 
it can write the graph of the
London Underground system into
this memory and then answer
questions about it, about the
connectivity of that graph.
So, that should be an interesting
one.
Then we move to unsupervised
learning, in particular
generative latent variable
models and variational
inference, delivered by Andriy
Mnih.
And I already mentioned:
Unsupervised learning is very
important because we don't have
labels for a lot of tasks, and in
particular an interesting model
that he will consider is the
autoencoder model,
the variational autoencoder
model. And the idea here is that,
from data, you might want to
infer some latent variable
responsible for generating that
data.
For example, there's the image
of a digit, and you might want to
determine what the label of that
digit is. But then, given the
label of that digit, you might
want to generate images of those

English: 
digits. And training these
jointly is the model of the
variational autoencoder, and
it's a very powerful model for
unsupervised learning, when you
don't have labels but when you
need your system to learn the
underlying representation. Then,
in lecture 10, we'll continue
with work on unsupervised
learning, with a focus on
representations.
And this will be delivered by
Mihaela Rosca and Irina Higgins,
and they'll ask the question:
What is a good representation?
Which, of course, is task
dependent but can we
characteris how we would like
to represent the world? And
they'll argue that unsupervised
learning has the potential to
address a lot of the open
problems that deep learning is
struggling with, for example,
the large amount of data, of
label data, that is needed, and
they'll discuss different
approaches to this.
Here's a little example.
There's a data set of 2D
projections of 3D chairs, and, you

English: 
training these jointly is the model of
the variational auto encoder and it's a
very powerful model for unsupervised
learning when you don't have labels but
when you need your system to learn the
underlying representation then in
lecture 10 will continue with work on
unsupervised learning with the focus on
representations and this will be
delivered by me al or Oscar and Irina
Higgins
and they'll ask the question what is a
good representation which of course is
task dependent but can we characterized
how we would like to represent the world
and they'll argue that unsupervised
learning has the potential to address a
lot of the open problems that deep
learning is struggling with for example
the large amount of data of labelled
data that is needed and they will
discuss different approaches to this
here's a little example there's a data
set of 3d or of 2d projections of 3d

English: 
know, they just come in,
and the system is just confronted
with those. How would it
structure its perception of
those?
It turns out that this beta
variational autoencoder is
able to find disentangled
dimensions in this chaos where,
for example, here is a
dimension along which the
rotation of the chair is
discovered as one independent
dimension, the width of the
chair is discovered as another
independent dimension, and the
leg style is discovered as yet
another independent dimension.
So, this algorithm is thrown at
this collection of data, and it
discovers what we would call
natural factors in this data, and
that, of course, can be very
useful for downstream tasks.
In lecture 11,
we'll talk about generative
adversarial networks.
This is delivered by Mihaela
Rosca and Jeff Donahue. And
this is a particularly

English: 
chairs and you know they just come in
the the system is just confronted with
those how would it structure its
perception of those it turns out that
this bethe variational auto encoder is
able to find different angle dimensions
in this chaos where for example here's a
dimension along which the rotation of
the of the chair is discovered as one
independent dimension the width of the
chair is discovered as another
independent dimension and the leg style
is discovered as yet another independent
dimension so this algorithm is thrown at
this collection of data and it discovers
what we would call natural factors in
this data and that of course can be very
useful for downstream tasks in lecture
11 we'll talk about generative
adversarial networks
this is delivered by Mahalo rasca and
Jeff Donohue and this is a particularly

English: 
fascinating recent development namely a
model a generative model for data that
is based on a little game that is being
played it's really multi agency to
agency there are two planners the
generator that generates data and it
discriminate that tries to discriminate
if I want to generate a generated if
that's the genuine image for the data or
if it's just something that there was
Maori generate and by thing is name in
gradient space if you like this is gonna
make it becomes better measuring what
the generals from real data but the
generator becomes better and better at
fooling the discriminator to generate
data that really looks like real data
and there are now many variations of
this it's in Goodfellow started this
work in 2014 and it's it's one of the

English: 
fascinating recent development,
namely a model, a generative model
for data, that is based on a
little game that is being
played.
It's really a multi-agent
thing, a two-agent thing.
There are two players: the
generator that generates data
and the discriminator that tries
to discriminate,
find out if what the generator generated, if that's a genuine
image from the data or if it's
just something that was made up
by the generator. And, by playing
this game together in gradient
space, if you like, the
discriminator becomes better and
better at distinguishing what
the generator generates from
real world data.
But the generator becomes better
and better at fooling the
discriminator to generate data
that really looks like real data.
And there are now many
variations of this.
Ian Goodfellow started this work
in 2014, and it's one of the very
interesting and hot topics in

English: 
very interesting and hot topics in deep
learning now finally and I'm very happy
that we have this in the program the
last lecture will be on the topic of
responsible innovation and will be given
by our son Gabriel and Charlie chin and
the thought here is that AI provides
powerful tools that are shaping our
lives and our society but that of course
with the great power that we can wield
here also comes great responsibility and
we would like to address that in this
final lecture in two ways the first one
is about building safe robust and
verified AI systems that do what we
like them to do for example you hear you
see an example of an adversarial example
where a classifier classifies this image
correctly as a deer but if you add just
a tiny bit of noise to it the right kind

English: 
deep learning.
Now, finally - and I'm very happy that
we have this in the program -
the last lecture will be on the
topic of responsible
innovation and will be given by
Iason Gabriel and Chongli Qin.
And the thought here is that AI
provides powerful tools that are
shaping our lives and our
society, but that, of course, with
the great power that we can
wield here also comes great
responsibility. And we would like
to address that in this final
lecture in two ways.
The first one is about building
safe, robust, and verified AI
systems that do what we expect
them to do.
For example, here you see an
example of an adversarial
example, where a classifier
classifies this image correctly
as a deer.
But if you add just a tiny bit
of noise to it, the right kind

English: 
of noise adversarial noise it will then
miss classify almost the same image as a
bird so these systems are not robust and
we need to understand the boundaries of
what these systems can do and there's a
whole field developing around this idea
that we need our AI systems to be a lot
like normal engineering reliable and
robust and so on and we're beginning to
understand how to do this better and
better and the second aspect of this
lecture is how to think about the ethics
ethical consequences of building AI
systems and more as a joke and the
starting point I've put here the Three
Laws of Robotics by Asimov the science
fiction writer because you know already
80 years ago he was thinking about the
consequences of deploying AI systems in
the world and had some idea of how they

English: 
of noise, adversarial noise, it
will then misclassify almost
the same image as a bird.
So, these systems are not robust,
and we need to understand the
boundaries of what these systems
can do.
There's a whole field developing
around this idea that we need
our AI systems to be a lot like
normal engineering, reliable and
robust and so on, and we're
beginning to understand how to
do this better and better.
And the second aspect of this
lecture is how to think about
the ethical consequences of
building AI systems and, more as
a joke and a starting point,
I've put here the three laws of
robotics by Asimov, the science
fiction writer, because, you
know, already 80 years ago he
was thinking about the
consequences of deploying AI
systems in the world and had
some idea of how they would

English: 
need to be thought about, the
ethics of this, and what kinds of
laws they should be following.
And, of course, now we are
confronted with systems that
actually do influence the world.
You know, you have questions of
employment where there's
automation, and AI systems can
now do a lot of the jobs that
humans can do, more and more in
the future probably.
You have the questions of bias
and fairness, that these systems
inherit certain properties from
the data sets that they learn
from that we might not agree
with and that we might want to
change.
Maybe an opportunity to change
some things here.
And so Iason will talk about
these aspects of our
work in AI to close the
series of lectures.
That's all I had to say.

English: 
would need to be thought about the
ethics of this and what kinds of laws
they should be following and of course
now we are confronted with systems that
actually do influence the world you know
you have questions of employment where
where there's automation and AI systems
can now do a lot of the jobs that that
humans can do more and more in the
future probably you have the questions
of of bias and fairness that these
systems inherit certain properties from
the datasets that they learned from that
we might not agree with and that we
might want to change maybe an
opportunity to change some things here
and so yes and we'll we'll talk about
these aspects of our work in AI close

English: 
Thank you very much.
Now we have some time for
questions.
Do you have any?
Yes, please.
I wonder, what are your thoughts on
what the most important limitations
are of deep learning, if there are any?
Yes,
so, the question is what are the
most important limitations of
deep learning
and if there are any.
There are plenty,
I would say. It's an emerging
technology.
If I have to name my number one,
it would be the lack of data
efficiency.
So, with all of these examples
I've shown we had a lot of data
available. In these reinforcement
learning scenarios,
we generated a lot of that data
ourselves.
But in other settings we might
need to draw on the existing
data sets, and they might just be
limited in size.

English: 
the series of lectures okay that's all I
had to say thank you very much now we
have some time for questions do you have
any yes please yes so the question is
what are the most important limitations
of deep learning and if there are any
there are plenty I would say it's an
emergent technology if I had to name my
number one it would be the lack of data
efficiency so with all of these examples
I've shown we had a lot of data
available in these reinforcement
learning scenarios we generated a lot of
that data ourselves but in other
settings we might need to draw on
existing data sets and they might just

English: 
be limited in size for example in the
protein folding situation we have these
150,000 folded proteins with a lot of
duplicates but that's not actually that
much data for what deep learning
requires and my colleagues used a lot of
data augmentation techniques to squeeze
out the maximum of information out of
this so that would be my number one data
efficiency because as humans we are also
we're very data efficient in our
learning and if we could approach that
with deep learning I think that would be
great another one of my concerns that is
a little related to it is energy
consumption because these computational
systems also consume a lot of energy and
we know that our brain works on roughly
20 watts like that's a dim light bulb so
how how can we bridge that gap that's
another one of those and then on the
side of a I I think there's way to go in

English: 
For example, in the protein
folding situation, we have these
150,000 folded proteins
with a lot of duplicates, but
that's not actually that much
data for what deep learning
requires.
And my colleagues used a lot of
data augmentation techniques to
squeeze out the maximum of
information out of this.
So, that would be my number one,
data efficiency, because, as
humans, we're very data
efficient in our learning
and, if we could approach that
with deep learning, I think that
would be great.
Another one of my concerns that
is a little related to it is
energy consumption because these
computational systems also
consume a lot of energy, and we
know that our brain works on
roughly 20 watts, like, that's a
dim light bulb.
So, how can we bridge that gap?
That's another one of those.
And then, on the side of AI, I
think there's a way to go in

English: 
questions like flexibility and common
sense you know what we might
characterize as fluid intelligence our
it
to quickly adapt to a new question to a
new situation to understand to act
appropriately those are the big areas to
work on plenty of key ftcc's out there
yeah so I think they go in the right
directions but they're really only the
the beginning of the path the goal your
you asked you correctly characterizes

English: 
questions like flexibility and
common sense,
you know, what we might
characterise as fluid
intelligence, our ability to
quickly adapt to a new question,
to a new situation, to
understand it, to act
appropriately.
Those are the big areas to work
on.
Plenty of PhD theses out there.
Of the four examples that you gave,
the environments were very much controlled.
Yes, there was a wide variation in the options
that you could choose, but the environments were controlled.
So, the first question or preface that you put forward
was intelligence with all of these different options
and all of these different types of intelligence.
How are these four scenarios or case studies helping you
achieve that wider goal that you prefaced at the beginning?
Yeah,
so, I think they go in the right
directions but they're really only
the beginning of the
path.
The goal, you ask, correctly
characterises that we want this

English: 
that we want this generality these many
different things and how do how do these
settings help us in the entire games
there clearly is a set of very different
games that characterize you know that
that require a very different behavior
so there is some diversity there is it
enough diversity you know ideally we
would have a less constrained setting
more real-world noisier with more
different challenges that's true but I
think you have to see the trend you know
for example in alphago we started with a
single game and a lot of human knowledge
that came into it and then we
generalized it to three different games
that the system can now play and of
course it could be trained on other
games as well so we expanded the class
of things and I think the game we're
playing here is we want to always expand
the class of things that we can do with
our available methods and and eventually
have a very large pool of things that

English: 
generality, these many different
things.
And how do these settings help
us? In the Atari games,
there clearly is a set of very
different games that
characterise, you know, that
require very different
behaviours.
So, there is some diversity
there. Is it enough diversity?
No, ideally, we would have a less
constrained setting, more real
world, noisier, with more different
challenges.
That's true.
But I think you have to see the
trend.
You know, for example, in
AlphaGo we started with a single
game and a lot of human
knowledge that came into it, and
then we generalised it to three
different games that the system
can now play and, of course, it
could be trained on other games
as well.
So, we expanded the class of
things.
And I think the game we're
playing here is we want to
always expand the class of
things that we can do with our
available methods and, eventually,
have a very large pool of things
that can be done.

English: 
can be done
yes that's a good question so does this
autonomous driving take us somewhere
near our real intelligence because it
requires the right reaction in so many
different scenarios another way that we
think about this is is autonomous
driving AI complete is it true that if
you can flawlessly drive and have a
self-driving car does that mean that you
have a full intelligence and I think

English: 
Can I ask one follow-up question?
Do you believe that automated driving,
in terms of driverless cars,
if we can solve that -
and we are still some way away from that,
yes, we can do that on the motorways
and highways, but all the different
scenarios you are looking at,
the generality, the nuances, the common sense,
the judgement you need to actually
be operating within that environment -
do you think that if we solve that
we would be a long way to solving
the preface that you outlined
at the beginning?
Yeah, that's a good question.
So, does autonomous driving take
us somewhere near real
intelligence
because it requires the right
reaction in so many different
scenarios?
Another way that we think about
this is autonomous driving AI-
complete?
Is it true that if you can
flawlessly drive and have a
self-driving car, does that mean
that you have a full
intelligence?
And I think it's kind of in
between
because there still comes

English: 
it's it's kind of in between but because
there still comes some constraints there
but it is it's almost like that because
there is such a wide variety of
different scenarios as you're describing
and currently systems don't have the
flexibility to react to these so I don't
think the the right approach here is to
to just add data and try to sample that
space of situations we need to come up
with new ideas how the system can reason
about about the world you know maybe
with physical models maybe with multi
agent models because a lot of the
predictions we do in in traffic are
based on our understanding of what the
other agents want you know does this
agent want to dis car want to overtake
me what does that bicycle ride I want to
do do they want to turn so all of these
things so I think as we improve our
understanding of how to model physical
systems how to model multi-agent systems
and how to acquire common sense then

English: 
some constraints there.
But it's almost like that
because there's such a wide
variety of different scenarios
as you're describing, and
currently systems don't have the
flexibility to react to these.
So, I don't think the right
approach here is to just add
data and try to sample that
space of situations.
We need to come up with new
ideas how the system can reason
about the world, you know,
with multi-agent models because
a lot of the predictions we do
in traffic are based on our
understanding of what the other
agents want.
You know, does this agent,
this car, want to overtake me?
What does that bicycle rider
want to do?
Do they want to turn? All of
these things.
So, I think, as we improve our
understanding of how to model
physical systems, how to model
multi-agent systems, and how to
acquire common sense, then we'll

English: 
be approaching autonomous
driving.
It's a beautiful example because
it's so clear what we want to do
here in autonomous driving, yet
it is so hard.
I was wondering whether DeepMind will
be entering the new CASP round this year?
That's a good question.
Are we entering into the new
CASP competition?
I don't know.
But my colleagues are certainly
very passionate about the
problem, and work has continued
on protein folding. For us,
this protein folding,
this was one step in that work,
and we're in that for the long
term,
so they are certainly
interested.
With reinforcement learning, you have
two agents learning, one from the other.
Isn't there a risk that actually they may not

English: 
we'll be approaching autonomous driving
it's a beautiful example because it's
kind of it's so clear what we want to do
here in autonomous driving yet it is so
hard other questions yes please
that's a good question are we entering
into the new caste competition I don't
know
but my colleagues are certainly very
passionate about the problem and and
work has continued on protein folding
for us this protein folding this this
was one step in that work and but we're

English: 
in that for the long term so they are
certainly interested yes please right
yeah so the question is if we do
reinforcement learning with say two
agents that train against one another
will they explore the whole space of
possibilities or could they be get stuck
in some particular part of strategy
space so that's that's a very real
concern even in a single agent
reinforcement learning setting the
problem of exploration versus
exploitation is a huge problem so in
order to reach long-term reward maximize
long-term reward the agent needs to
explore the system and figure out what
works and what doesn't but at the same
time it can then not exploit what it
already knows about the system because
if it only exploits it will not find out
about other parts of that space so even
in single agent RL exploration is a huge
problem in a 2-player self play kind of
scenario it can be a problem but we were

English: 
explore the full space?
So, the question is, if we do
reinforcement learning with, say,
two agents that train against
one another,
will they explore the whole
space of possibilities or could
they get stuck in some
particular part of strategy
space?
So, that's a very real concern.
Even in a single-agent
reinforcement learning setting,
the problem of exploration
versus exploitation is a huge
problem.
So, in order to reach long-term
reward, maximise long-term
reward, the agent needs to
explore the system and figure
out what works and what doesn't.
But, at the same time, it can then
not exploit what it already
knows about the system, because
if it only exploits it will not
find out about other parts of
that space.
So, even in single-agent RL,
exploration is a huge problem.
In a two-player or self-play kind
of scenario,
it can be a problem.

English: 
But we were able to overcome it
with a bit of randomisation in
AlphaZero that somehow led it to
explore a lot of other options.
But in the Capture the Flag work
we went one step further, and we
created a pool of diverse
players that played differently.
And that was in an attempt to
address exactly the problem we
are pointing out because that
forces a single player then to
have a strategy that can deal
with a number of other opponents'
strategies and even teammates'
strategies.
And that's the same type of
robustness that we're looking
for in other intelligence
systems,
you know, be able to deal with a
large variety of situations, be
that environments or opponents
or teammates.
That's exactly the question.
How can you verify that?
How can you check that the
space is properly explored?

English: 
able to overcome it with a bit of
randomization in in alpha zero that has
somehow led led it to explore a lot of
other options but in the
capture-the-flag work we went one step
further and we created a pool of diverse
players that play differently and and
that was in an attempt to address
exactly the problem we're pointing out
because that
a single player then to have a strategy
that can deal with a number of other
opponent strategies and even teammate
strategies and they that's the same type
of robustness that that we're looking
for in other intelligent systems you
know be able to deal with a large
variety of situations p-nut environments
or opponents or teammates that's exactly
the question yeah again that's a very

English: 
good question how do we measure that one
way of measuring it is by doing
experiments with humans because humans
are incredibly good at figuring out the
patterns and finding counter strategies
and that's why we did both in alpha go
and alpha zero but also in the capture
the flag work we benchmarked with humans
and they did find some interesting
strategies an alternative is to Train
yet another agent that is just designed
to exploit the given agent that that
we're trying to test it's just designed
to find the weaknesses in it and if
reinforcement learning works then that
agent will find weaknesses in it and in
the alpha star work where my colleagues
applied deep learning and RL to the game
of StarCraft they had explicit exploiter
agents in the pool they were training
against whose sole role was to exploit
the given agents as being trained to

English: 
How do we measure that? One way
of measuring it is by doing
experiments with humans because
humans are incredibly good at
figuring out the patterns and
finding counterstrategies.
And that's why we did, both in
AlphaGo and AlphaZero but also
in the Capture the Flag work, we
benchmarked with humans, and they
did find some interesting
strategies. An alternative is
to train yet another agent that
is just designed to exploit the
given agent that we're trying to
test.
It's just designed to find the
weaknesses in it, and if
reinforcement learning works,
then that agent will find
weaknesses in it. And in the
AlphaStar work, where my
colleagues applied deep learning
and RL to the game of StarCraft,
they had explicit exploiter
agents in the pool they were
training against whose sole role
was to exploit the given agents
that's being trained, to avoid it
from developing degenerate

English: 
strategies.
Any other questions?
It seems to me that correctly
specifying the reward function
in reinforcement learning is very
important, especially in cases
where you have intermediate
rewards for subgoals, such as in the
Capture the Flag case where you
can get a reward for tagging a player
because, if you get too many
points for that, your agent might
decide to ignore the game and
just go around tagging enemies.
If that's true, how do you think
we can get to a point where we
have a more general approach, where
humans do not have to look at the
game and very specifically create a
reward function?
Yes, that's a good
question.
So, it is a big problem:
How do we specify rewards
functions such that the
resulting system that optimises
them actually solves the problem
that we're interested in?
And the answer is sometimes
simple. In a game of Go that
lasts, you know, you want to win
rather than lose,
so there it's clear. In the game of

English: 
avoid it from developing degenerate
strategies any other questions yes
you immediately works for such as in
makeup to the black case where you never
work for tagging the therapist if you
capture many points for that
your agent might decide to ignore the
game and just go around attacking
enemies yeah a nice place and if that's
true yes that's that's a good question
so it is a big problem how do we specify
a reward function such that the
resulting system that optimizes them
actually solves the problem that were
interested in and the the answer is
sometimes simple in the game of go that
last you know you want a win rather than
lose so there is clear in the game of

English: 
Capture the Flag, it's also clear.
You also want to win that game.
But it's such a sparse reward
signal that you want
intermediate rewards.
And the way we did it in that
work - I didn't dwell on that - is
we actually learned a weight for
the different game events that
we can observe, you know,
capture a flag, tag an
opponent, be tagged, your flag
being stolen, and so on. There's
a whole list of game events.
Those are the events that the
real players of this game get
points for in the game.
And we learned a weighting
function for these, and the
weighting function was learned
to optimise for the final
outcome goal.
And so there is a way of
bootstrapping the learning
process, from the final outcome
to denser, more informative
reward signals that the RL
agents can then pick up on.
That's one way of solving it, I
think.
Okay,
I'm afraid we need to wrap up.
Our time's up.
I hope we can welcome you to the
next lecture on the basics of

English: 
capture the flag it's also clear you
also want to win that game but it's such
a spaz reward signal that you want
intermediate rewards and the way we did
it in that work I didn't dwell on that
is we actually learned a wait for the
different game events that we can
observe you know capture a flag tag an
opponent be tagged your flag being stole
and so on there's a whole list of game
events those are the events that real
players of this game gets points for in
the game and we learned a waiting
function for these and the weighting
function was learned to optimize for the
final outcome go up and so there is a
way of bootstrapping the learning
process from the final outcome to denser
more informative reward signals that the
RL agents can then pick up on it's one
way of solving it I think okay I'm
afraid you need to wrap up our terms up

English: 
neural networks by Wojciech.
Have a great evening.

English: 
I hope we can welcome you to the next
lecture on the basics of neural networks
by a Vortech have a great evening
Thanks
you
