Hello everyone and welcome to
the next edition of the UCL
DeepMind lecture series.
I'm Mihaela, I'm a Research
Engineer at DeepMind and a PhD
student at UCL and together with
Jeff, I'm going to talk to you
today about generative
adversarial networks.
So let's start with an overview.
Why are we interested in generative
adversarial networks?
Well, generative adversarial networks
are a type of generative model
and generative models learn a
model of the unknown underlying
data distribution from a set of
samples from our dataset.
So, imagine this very simple
one D example, this is our
dataset. We have our points here
and we're trying to answer the
question, what kind of
distribution could have
generated this data? And we can
answer this question in two
ways. Firstly, we can learn an
explicit model of the data, this
kind of probability distribution
shown here, and then we can
answer questions using this
model. We can ask, well, how
likely is it that this point
comes from the original
distribution? And the answer in
this case would be not very
likely. We haven't seen any
samples here and our model thus
has no mass here, but we can
also then sample from this model
just like we usually sample from
probability distributions and
generate new types of data. And
this type of model that models
the probability distribution
directly is what is called an
explicit model. On the other
hand, we can also
learn implicit models. In
implicit models, we don't model
the probability distribution
explicitly. What we learn is a
type of simulator that is able
to generate new type of samples
that have the same statistical
properties as our original data
without being able to model the
distribution explicitly. So now
we have some new data points
shown here in blue that match
the properties of the data. And
importantly we've generalised,
we don't always see the points
in red. Now we were able to
generate new points only that we
want these new points to kind of
capture the statistical
structure of our data. And very
likely you've seen generative
models before and you've
probably seen explicit
likelihood models. So the kind
of model that has access to a
probability distribution. And
often these models are trained
by what's called maximum
likelihood. In maximum
likelihood, we train a model to
maximise the probability
distribution of the data under
our model. And such models are
probabilistic PCA, factor
analysis, mixture models and so
on. And you can also train
neural based models using maximum
likelihood things like pixel
CNN, pixel RNN, WaveNet,
autoregressive language models
and soon. But when you want to
train latent variable models
with maximum likelihood, things
get a bit more tricky. And
that's when in practice we often
use approximate maximum
likelihood. And then another
lecture, Andriy has talked to you
about how to train variational
autoencoders using approximate
maximum likelihood. But today
we're going
to talk about implicit models.
This kind of simulator models
that just generate new samples
without giving us access to
likelihoods. And we're going to
focus on one type of implicit
model, specifically generative
adversarial networks. And why
would we want to focus on
generative adversarial networks?
Well, one practical reason is
that they're able to generate
samples that look like this. So
these are samples from BigGAN, a
model that Jeff is going to talk
to you about later, in which the
model trained on ImageNet, a
dataset that has a lot of
variety. It has images of boats,
birds, dogs, food and so on. The
model is able to learn that
statistical property present in
the data and generate samples
that match that and samples that
look very photorealistic. So
these are all generated from
BigGAN.
And this is specifically
remarkable, if we think of how
this progress has been made
through the last few years. So
the original GAN paper in 2014
showed that we can go from
simple images of digits to
images of faces, black and
white, small resolution. But
from there a small revolution
has started kind of going faster
and faster and through
generating better and better
images. So we go from black and
white to coloured images. Then
we go to higher and higher
resolution of face pictures of
faces. Then we break the
ImageNet barrier in 2018 these
are the first models that are
trained on
a dataset that contains such
variety as we've seen in
ImageNet. Then we start to
generate faces at very, very
high resolution with progressive
GANs. This is starting to look
quite photo realistic. Then
BigGAN comes along and we're
generating ImageNet samples, not
only on a high diversity dataset
but also at high resolution,
very high quality. And then we
move on to StyleGAN which was
published last year in which the
author showed that you can
really generate very high
quality samples that look
indistinguishable from photos to
the human eye. So if you were to ask
me whether this person here
exists and this is a photo or is
this a sample from a GAN I will
not be able to tell the
difference. This looks
incredibly, incredibly
realistic. So this really
inspires us to think, well how
are GANs able to learn this
probability distribution so
accurately that we're able to
generate this high quality data.
And the answer is that they
learn to generate data through
an implicit model. So our model
doesn't have this explicit
likelihoods via a two player
game. And our players are a
discriminator that learns to
distinguish between real data
from our dataset and generated
data generated by our model and
a generator. And the generator
learns to generate data that
fools the discriminator into
thinking it's real. So it has to
generate really quality good
quality data such that the
discriminator thinks, well this
looks as good as real data.
So let's look
at our players in a little bit
more detail. So our players are
both are both going to be
modeled using deep neural
networks. So our generator is
going to have as input latent
noise. So what do we mean by
that? We need in some sense to
model the entropy and the
variety of our data
distribution. And the way we do
that is that we have a
distribution on the input of our
model. Because remember on the
output, the generator now will
not have any distribution. It
will just produce samples as the
output. So if you've seen
something like a variational
auto-encoder you're used to
having a distribution on the
output of the model. Here, we
have absolutely no distribution.
So in order to model the entropy
of the data, we have to have a
distribution in the input. And
often this is, multi-variate
Gaussian noise. And
interestingly here, this noise
is often much lower dimensional
than the data, the data is going
to be a high resolution image
while the noise is going to be
something like a hundred or two
hundred Gaussian units. We take
a sample from our latent noise
distribution, we pass that
through our deterministic deep
neural network, that transforms
that distribution to generate a
sample. And that sample can be
images or texts and so on.
The discriminator on the other
hand, has a different task. The
discriminator has to answer the
question, given some set of
samples from our data and given
some set of samples from our
model, are these real or are
these generated? So it has to
answer the question of
distinguishing between these two
distributions, the data
distribution and the model
distribution. And perhaps in a
less adversarial view, we can
think of the discriminator as a
teacher, a teacher that learns
what you're doing well and what
you're doing not well and tells
you how to improve such that you
get better at better at
generating real data, from the
generator's perspective. And
from this perspective we can
think of the discriminator as
some sort of learned loss
function because the
discriminator guides your
training, the training of our
model. But while it guides it,
it also improves itself. And in
the original GAN paper, this
was done via a two player game.
So now we have a minimisation
with respect to our generator,
this is our model and a
maximisation problem with
respect to our discriminator of
the same value function. And
this value function says, well
make sure that the discriminator
is very good at distinguishing
between real and fake data in a
classification sense. So we're
trying to train a discriminator
as a classifier to maximise the
log probability that the real
data is real and to maximize the
log probability that D predicts
that the generated data is
generated. So D so far is a
classifier. Once D's trained, so
this is what that min-max game is
telling us that once the
discriminator has been updated,
we need to train the generator.
And the goal of the generator is
the opposite of the
discriminator. It's a
minimisation problem with the
same objective as the
discriminator, but with a
different sign. And the goal of
the generator is to minimise the
prediction accuracy of D in
order to make sure that the data
that is generates it generates
is classified as real as opposed
to fake. And if we think about
this from an algorithmic
perspective, how would we
implement this? Well, we'll
implement our discriminator and
our generator as deep neural
networks and we will
train them using stochastic
gradient methods. So to do that,
we first have to train our
discriminator for a few steps in
practice, this is one or two. So
remember that the min-max game
said, well, I have to maximise
with respect to the
discriminator before training
our generator. That would entail
doing multiple steps of
optimisation. But in practice we
don't really have the resources,
the computational resources to
do that, to update the
discriminator to optimality
every time we want to update the
generator. So we only do a few
steps of gradient descent for
the discriminator. And the way
we do that as well, we sample a
mini batch of data, we sample a
mini batch of noise latents from
our prior. We pass that through
the generator. Now
we also have a mini batch of
samples from the generator and
we update the discriminator by
stochastic gradient methods to
make sure that our loss is being
maximised. So we want again to
make sure that we maximise the
probability that real data is
real and maximise the
probability that fake data
generated by the generator is
classified as generated. Once
we've done this small inner loop
of updating the discriminator,
we can move on and update the
generator and now the generator
aims to make sure that the data
that is now generated, so we
sample a new batch of noise
samples, we pass that through
the generator, we have a new set
of generated data, then this
data is classified as real by
this new improved discriminator,
which we've kept improving in
our last stage of the training
of the discriminator. So we have
this game that we alternate
between improve the
discriminator at distinguishing
between real and generated data.
Then use this new discriminator
to update the generator such
that the generator generates
data that discriminator, that
the discriminator deems as being
real. So the take-home message
so far is that GANs are able to
generate high quality samples
through this implicit generative
model, trained as a two player
game, a discriminator that
learns to distinguish between
real and generated data and a
generator that learns to
generate data that looks so good
that the discriminator cannot
longer distinguish between real
and generated data.
And we've seen that this is done as
zero-sum game. We have a
minimisation with respect to G
maximisation with respect to D
of the same value function. And
this has a lot of connections
with game theory literature. We
can think of Nash equilibria, we
can think of strategies that the
two players might employ. We can
use things such as fictitious
play to improve our game. But in
practice is perhaps also
interesting to think of GANs
from the perspective of distance
or divergence minimisation. And
that is because we often think
of generative models as doing
distance or divergence
minimisation and very often
explicitly our loss function is
it distance or divergence. So
we've already talked about
maximum likelihood. Maximum
likelihood maximises the
likelihood of the data under the
model which is the same as
minimising the KL divergence
between the data and the model.
And why would we want to do
divergence or distance
minimisation? Well, divergences
and distances give us some
really nice connections to
optimality. If the distance
between two distributions is
zero then we know that the two
distributions are the same. So
from the perspective of
learning, if we've trained our
model to minimise this distance
and our distance is zero, we
know that our model is a perfect
fit of our data distribution,
which gives us a very nice
guarantee. And again, if we look
at maximum likelihood, the
objective is of maximum
likelihood is to minimise this
KL divergence, which
is the expected value so this is
this integral under the data
distribution of the log ratio
between the data distribution
and the model and because this
is something that we minimise,
so we minimise with respect to
the parameters of this model P
of X we want to make sure that
this is as high as possible
because then this ratio is as
high as possible. This ratio is
as low as possible because this
p-star is fixed. This is our
data distribution. So though
this expectation is as low as
possible, so we want to make
sure that P of X is giving high
likelihood to our data, which is
very intuitive if we want the model
that is able to explain our data
and yes, the KL divergence has
the same property. If the KL
divergence between two
distributions is zero, then our
model has learned our data
distribution. But one question
that you might have here is
well, if we are able to say this
for a lot of distances and
divergences, if there's zero,
then our model has learned a
data distribution, why are we
concerned with different
divergences or distances? And
the answer is that well in
practice our model might be
misspecified and it might not be
able to model the true data
distribution. And this can even
be the case for very deep neural
network models because it might
still be that our dataset, for
example ImageNet, is that
complex that
we're not able to model the data
distribution exactly. And in
that case we might, and we might
want to ask, well what kind of
different trade-offs do different
distributions have? So for
example, here our data is a
mixture of two Gaussians and our
model is going to be a Gaussian
distribution and the Gaussian
distribution cannot model our
full data distribution because
it's a misspecified model. And
one question that we might have
is well, what will happen if we
train, for example, using the
maximum likelihood KL, so the KL
between the data and the model
and the reverse KL between
the model and the data. Because
the KL divergence is not
symmetric. And what we see here
is that the behaviour is very
different. When we use the
maximum likelihood KL. The
objective remember is to be able
to explain samples from our
data, all of the samples from
our data and if we sample from
our original distribution, from
our data on distribution, we'll
have samples here and samples
here and for a Gaussian
distribution to explain both of
these peaks, it will have to put
mass all around them, which
means that yes it will be able
to explain the data but it also
going to have a lot of mass here
where actually we don't have any
mass under the original
distribution. On the other hand,
if we use the reverse KL,
this is not what we will see.
What we will see
is that the model is going to
focus only on one of the modes.
It's going to be able to explain
that very well, but it's going
to completely ignore the second
mode and if you then query your
model to say, is it likely that
data here comes from the
original data distribution, it's
going to wrongly answer, no,
because it's not able to capture
anything about this mode. So
even with this very simple
example of one dimensional data,
we can see the trade-offs of the
kind of distribution that we
choose and that's going to guide
us through as we go forward. So
what natural question now might
be well are GANs doing the
divergence minimisation? We
talked about this two player
game on optimisation between the
discriminator and the generator.
How is that connected to doing
divergence minimisation? And the
original paper showed that yes,
it is connected if the
discriminator D is optimal, so
if we've trained a perfect
classifier to distinguish
between samples from the data
and samples from the model, then
the generator G is minimising
the Jensen-Shannon divergence
between the true and the
generated distributions. And
this is great because it also
gives us this connection to
optimality that we talked about
before. Now if the
Jensen-Shannon between two
distributions is zero, then the
two distributions are the same.
And now we want to understand a
bit more about the
Jensen-Shannon divergence. How
does it behave, for example, in
the case of
the misspecified Gaussian when
our original distribution is a
mixture of two Gaussians? And the
answer is that well, it does a
bit of maximum likelihood and a
bit of the reverse KL because
by definition it is a mixture of
the two. And in practice, the
answer depends on how you
initialize your model. So if you
don't initialize your model to
close up your two peaks, then
it's going to do the maximum
likelihood solution, otherwise,
if you initialize it very close,
then it will revert to the
reverse KL. However, in
practice the discriminator is
not optimal. As we've seen from
the algorithmic perspective, we
often have limited computational
resources. We can't train the
discriminator to optimality
every time we update the
generator so that at each step,
the generator is minimising the
Jensen-Shannon divergence. And
even if we did, even if we
were to train detail optimality,
given our data, we still still
don't have access to the true
data distribution. Just a few
samples from it, our dataset,
so we will still not have a
truly perfect discriminator. And
we're going to see why that is
important later on. But let's
look at more properties of the
KL and the Jensen-Shannon
divergence and here for
simplicity, I'm going to focus
on explaining this on the KL
divergence, but the same can be
said about the Jensen-Shannon as
we've seen the Jensen-Shannon
is a mixture of two KLs.
And this property is important because
this has really sparked the
field to perhaps look beyond the
Jensen-Shannon divergence to get
other divergences that we can
use to train GANs. And why is
that? Well we see here our
example that we're going to run
throughout is a case where we
have two distributions with no
overlapping support. So what do
I mean by that? Here we have our
data distribution in red and our
data distribution produces
samples here and its PDF is
given by this truncated Gaussian
here shown also in red and we
have our model and our model is
also a truncated Gaussian and we
have a few samples from it here.
One thing that we observe is
that there is no place in one D
where both of them assign non
zero probability so the data
only assigns non zero
probability here but here the
model says well this is not
really likely under the model.
And what happens in this case is
that the KL divergence and the
Jensen-Shannon are going to be
constant. So the KL is going to
be infinity and the
Jensen-Shannon is going to be
log2. And why is that? Well
remember the KL divergence
definition is the expected value
under the true data distribution
of a log ratio. And this log
ratio is the ratio between the
data distribution and the model.
And if we look at the, at
this ratio under the data
distribution, because this is
our expectation, we see, well we
will have the probability of
this data, the sample here under
the data distribution, which we
can query is something obtained
from here divided by the
probability distribution of the
data under the model. This is
where the problem comes from.
This probability distribution is
zero because the model assigns
zero mass here. So this ratio is
infinity so our KL divergence is
going to be infinity. And this
is especially a problem from a
learning perspective because
when we learn a model, we want
to get rewarded if we do
something good, right? So
imagine the case where I've
moved my model a little bit from
here, a bit closer to the data
here. So this is good. The model
is doing something good. It's
going closer to my data
distribution and we would want
the type of loss function that
says, yeah, good job. You're
going in the right direction,
you're doing well. But the KL
and the Jensen-Shannon, they
can't do that because this
property that the ratio is still
infinity here still holds you
still, even though you've moved
your model closer to the data,
you're still at a point where
this ratio is infinity because
there's still no overlapping
support. So this is why people
thought, well, perhaps we should
try to train GANs that are
inspired by a different
divergence. So the question is,
can we choose an under
V for our min-max game and will
it correspond to a different
distributional divergence? And
to do that, we have to look at
other divergences and distances
and see whether we can somehow
turn that into a game that you
can, we can use for our GAN
train. And one very nice
distance is the Wasserstein
distance between two
distributions. It looks slightly
different than the KL. We
already see that there's no
ratio. We have a difference of
expectations here and a
maximisation, so just to
estimate the divergence, we have
to do a maximisation and this
maximisation has to be over one
Lipshitz functions. So
1- Lipschitz functions have to be
relatively well
behaved, which means that the
difference from an absolute
value of the function at two
points has to be smaller or
equal than the absolute value of
the two points. So you can grow
too fast in a particular
region. So this means that the
function has to be relatively
smooth. And here when we
maximise with respect to the set
of functions, we're trying to
maximise the difference in
expectation of the expected
value of the function under the
data distribution minus the
expected value after the
function under the model. So
let's look at an example here.
This is our example from before.
Only that here we're not going
to use the PDFs themselves, but
we're going to use samples from
the model. So these are our
samples from our data
distribution, these are our
samples from our model and we're
trying to find the function f
that can separate these
expectations as much as
possible. So here we can see
that we can put positive mass
under function f around the data
distribution. That is
expectation is positive because
we are sampling here we are
evaluating the function at all
those points, all these points
are positive. So this expected
value is going to be positive.
We do the same for the model,
but here the model under the
model, the function is negative.
So when we take the difference,
the difference is going to be
large, it's going to be
something, a positive, a
positive number minus a negative
number. And importantly the
Wasserstein distance goes down.
Now, if we have a model that
goes closer to the data, even
when we don't have overlapping
support, because remember, this
function has to be one
Lipschitz. It can grow too fast
in a small neighbourhood. So
we're moving closer to the data.
We have restricted the amount of
growth that this function can
have and thus the difference in
expectation is small. So we now
have a distance that has this
property that if we're doing the
right thing, we're getting
rewarded for it, which is great.
Now the question is how do we
turn this into a GAN? So we've
talked so far about estimating
Wasserstein distances and we've
seen that this itself involves
an optimisation over one
Lipschitz functions, but
what we're interested in ideally
is in learning. How do we use
this to learn a probability
distribution or a model that can
generate data from our problem?
So we have now our minimisation
with respect to our generator
again, but now we want to do
with respect to the Wasserstein
distance and if we just replace,
so we keep the minimisation in
place and we replace the
definition of the Wasserstein
distance that we've seen above.
We have this form and this form
already looks very familiar. We
have a minimisation and a
maximisation. So if we think of
our function now that learns
to distinguish between
data samples and model
samples from an expectation
perspective, rather from our
ratio like we've seen before,
then this function can be
thought of as our discriminator.
So now our minimisation problem
with respect to G stays the
same. But we have a maximisation
problem with respect to our
discriminator, subject to the
discriminator being well behaved
and this loss function, this
value function that looks
different cause we're no longer
starting with a classification
and we're no longer getting to
the Jensen-Shannon divergence,
but to the Wasserstein distance.
But it's something that looks
very similar, right? So now we
have something that allows us to
distinguish between the data
samples and the model, but in a
Wasserstein sense, and we can
use that to train a GAN. And
this is what's called
Wasserstein GAN. And
we can look at other divergences
and distances. One of them is
MMD: maximum mean discrepancy.
And it looks very similar to the
Wasserstein case. Only that now
the optimisation is with respect
to a different class of
functions, functions that are
part of a reproducing kernel
Hilbert space. And if we look at
the behaviour of MMD on our
standard example, we see that it
does the same. The value of the
function is positive under the
data. The value of the function
is negative under the model only
that the shape of the function
looks different because we're
now looking at a different
family of functions to estimate
our model. And just like in the
case of the Wasserstein
distance, we can try to turn
this into a GAN. We have a
supremum over a class of
functions, we turned that into a
maximisation over our
discriminator, only then now
that the discriminator has to
be a part of a reproducing
kernel Hilbert space and we have
the loss function as an
expectation of, the difference
of expectations. And remember we
started talking about the KL
divergence. We started with
maximum likelihood as a very
common objective of training and
KL divergence is a type of
f-divergence and f-divergences
look like this, there's an
expected value, an f which is
fixed. So we know this function
for the KL, for example, and so
on and a density
ratio. The problem here is that
if we want to train something
like a GAN inspired by
f-divergences, we will encounter
issues because we don't have
access to P of X, we don't have
access to the probability
distribution. So how do we get
around this? Well, we can't just
start training models using the
f-divergences, but we can find
the variational lower bound on
our f-divergence objective and
use that instead. So if you've
seen VAEs before, variational
autoencoders, then there too,
we use a variational lower
bound, and we replace that in
our training objective. And in
this case, in the F divergence
case, the variational lower
bound is telling us to optimise
this objective instead. And this
objective now should look very
similar. We have a supremum over
class of functions and we have a
difference in expectations only
that now we also have the convex
conjugate, of the function f
from here. The things are
looking very, very similar to
what we've seen before and the
optimal D here is actually the
density ratio that we talked
about before and that we saw
that can cause problems in
practice. And we're going to go
back to this density ratio in a
bit. But importantly now,
because we have the same form
that we've seen in the
Wasserstein and MMD case, we can
also turn this into a GAN, just
slightly different objective.
Now we still have the convex
conjugate of f here, but we can
use this to train our model
So, so far what
we've seen is that we can train
GANs using multiple criteria
which are inspired by multiple
divergences and distances. We
started with the original GAN
and that did the Jensen-Shannon
divergence. Then we looked at
the properties of the
Jensen-Shannon divergence and
based on that we looked at other
distances and divergences that
maybe have different properties.
Those were Wasserstein and MMD.
And at the end we also asked the
question, well, okay, but how
about the KL divergence,
something that's very used in
practice? Can we train a GAN
inspired by the KL divergence?
And the answer there was also yes.
Now one question that you
might have is why would I train
a GAN instead of doing
divergence minimisation, if
divergence minimisation gives me
all this optimal convergence
properties. And the answer is,
well, it depends. In practice you
might not be able to do
divergence minimisation or you
might not want to do divergence
minimisation because GANs have
some advantages and we're going
to talk about this now.
So, firstly, remember how we
mentioned just now that the KL
divergence requires knowledge of
this model P of X which we don't
have in the case of implicit
models, of models like GANs. So
if we want to train a GAN
inspired by the KL divergence,
we have to use f-GANs, but now
at least we can train models
that don't have an explicit
likelihood using the KL
divergence,
which is something that we
couldn't do before. Right. So by
using GANs, we've expanded the
class of models that we can
train using KL divergence.
There's also the computational
intractability factor. We've
talked about the Wasserstein
distance and how just finding
the value for the Wasserstein
distance requires an
optimisation problem over a
class of functions, but that is
intractable for complex cases so
you wouldn't be able to do this
at each iteration step to find
the Wasserstein distance and
then use that for training. But
if you use the Wasserstein GAN,
which now will have the same
type of algorithmic
implementation as we've seen for
the original GAN: update the
discriminator a few times, two
or three, four or five times,
and then update the generator,
then you can get around that.
Yeah, you're not doing exact
Wasserstein distance
optimisation anymore because you
haven't solved this optimisation
problem, but you're still doing
something inspired by the
Wasserstein distance, but you
can now train a model. And
remember our problem with the
smooth learning signal, our
problem with the KL divergence
and the Jensen-Shannon and how
that inspired us to look at
other distances and divergences,
but perhaps that's not as big of
a problem in the GAN case as we
originally thought. This idea
that they will not give you any
signal to learn when there's no
overlapping support between the
data and the model. And why is
that? Well, remember our
example, the problem that we
have is that this density ratio
was infinity here. And that
meant that if I moved my model
closer to my data, I'm still not
getting any useful signal. But
in the case of GANs, I'm
approximating this ratio. So
perhaps we're not going to have
the same problems. So if you
look empirically, we can see
that GANs still learn. So in
this paper we showed that if the
data is here and the model is
here, so at initialization
there's no overlapping support
and we train our GAN the model
after a bit of training still
learns to match the data
distribution. So why is that?
Well, a simple way to think
about this is again, inspired by
the KL divergence, because
that's a simple divergence to
look at, but similarly we can
think about the Jensen-Shannon.
So if we look at the KL
divergence, we look at this
definition again, we have this
true ratio here. That's
problematic, right? Because this
is why we're getting these
problems with the KL divergence.
But in the case where we train
GANs, we actually use this lower
bound instead, remember when we
talked about f-GAN we used the
bound because we can't have
access to P of X? But now we
estimate this ratio using our
discriminator and we ask our
discriminator to be in a class
of family of functions because
we have to represent it somehow.
So that's either a deep neural
network or a function in a
reproducing kernel Hilbert space
and so on. And these functions
are relatively smooth. So we're
approximating our true ratio
with something smooth. And what
happens in practice is that the
smooth functions won't be able
to jump from zero to infinity or
to represent infinity as the
underlying ratio would. So our
standard example, again, we have
our data here. Our model here,
the true ratio here goes to
infinity. It's zero everywhere
else. But our MLP that is used
to approximate our ratio will
not go to infinity. It starts
low and then it starts growing
and growing and growing. It
needs, it knows that it needs to
be higher here, but it won't be
infinity. And the nice thing
about this is that if I move my
model closer to my data, it will
know because there's no jump of
exactly here, you need to go to
infinity. And this is similar if
I use another function class to
represent our ratio. So here if
we're using reproducing kernel
Hilbert space, we see the same
type of behaviour around the
data. We're going to have a
higher ratio, but it's not going
to be infinity. And again, if I
use my model closer to my data,
then I'm going to get a useful
learning signal that says, yeah,
good job, you're going into the
right direction. And this is why
empirically we've seen that the
GAN could learn, even though we
initialize the
models to not
have overlapping support. So the
crucial idea here is that the
discriminator is a smooth
approximation to the decision
boundary of the underlying
divergence. And we've seen that
with some experiments and with
an explanation of what happens
in the case of the KL
divergence. So in practice GANs
do not do divergence
minimisation because the
discriminator is not optimal. It
doesn't really represent that
true density ratio for example.
But this also means that GANs do
not fail in cases where
underlying divergence would, like
we've seen in the Jensen-Shannon
case. And perhaps another way to
think of discriminators is as
learned distances. So the
discriminator is providing a
loss function to our generator,
but it's something that itself
is learned to provide useful
gradients to our model. And this
is the case both for the
original GAN, the Wasserstein
GAN and so on, they all have
this form minimisation with
respect to G and maximisation
with respect to D of our value
function. But if we think of
this bit here, this is the loss
function for G, but it's just
trained, it's trained using the
discriminator parameters. Now
the crucial bit here is that we
can use this to tell the
generator through our loss
function what we actually care
about and the way we do that is
by putting the right neural
network features into the
discriminator. So we know that
if we're training
data on images, we want to use
convolutional neural networks
because those are very good at
distinguishing between images
and learning the right features
for that. If we're using audio,
we might want to use recurrent
neural networks and so on. So
the crucial bit here is that we
no longer just use neural
network features in our model,
but we also use it in our loss.
And now the loss can provide
additional signal to the model
to focus on the right aspects of
the data. And this is something
that a true divergence, not the
learnt divergence, this is not a
distance or divergence in a
mathematical sense, but this is
able to provide you some useful
learning signal that you maybe
wouldn't get if you were using
the KL divergence or something
else. So to answer the question
of, well, why would I want to do
GANs as opposed to divergence
minimisation? Well, we see that
GANs provide very good samples
and they are using this learned
loss function where you can have
this additional knob to train to
tell your model what to focus
on. But they're hard to analyse
in practice. You have to think
of game theoretical reasons and
so on. And in practice there are
no optimal convergence
guarantees because again, the
discriminator won't be optimal.
However, if you do divergence
minimisation, there are optimal
convergence guarantees and easy
to analyse loss properties, but
it's harder to get good samples.
And the loss functions
don't usually correlate with
human evaluations because they
focus on aspects pertaining to
the statistical properties of
the divergence rather than the
modality of the data.
So, the take-home message here is
that, in practice, GANs do not do
divergence minimisation and the
discriminator can be seen as a
learned distance. It's something
that's learned to distinguish
between the data and the model
samples and to provide useful
learning signal to the
generator. And one question that
you might have as well, which
GAN should I use? We've talked
about Wasserstein GAN, MMD GAN,
the Jensen-Shannon GAN, that's
the original GAN and so on.
And empirically it has been observed
that the underlying loss, so the
underlying divergence matters
less than the neural
architectures, the training
regime and the data. And I think
if you're thinking of the
importance, the importance of
the features that the
discriminator is learning and
the convolutional or recurrent
architectures underlying them
and the kind of information that
provides to the generator,
that's somewhat intuitive
because now you're focusing
really on the features that are
useful at distinguishing between
data and samples. And Jeff is
going to tell you a lot more
about this and give you plenty
of examples of neural
architectures that are used for GANs.
And so far we've talked
about unconditional generative
models. So far we're asking our
generator: generator, please
generate a sample. I'm giving
you some latent noise, generate
something out of it.
But we might want to have a knob
to tune and we might want to
tell the generator: generator,
please generate a cat or
generator, please generate a dog
and so on and for then we have
to change our model a little
bit. So, so far we've talked
about deterministic deep neural
networks that are able to
transform Gaussian noise into
data. But what we want now is to
provide additional input to the
generator to say, well please
generate a dog or please
generate a cat. And we often
provide that in the form of
one-hot vector. If our
conditioning information is a
label, we're going to say 1 0 0 0
for a dog, 0 1 1 for cat,
0 1 0 0 for cat and so on. And this
is going to tell the generator
what it needs to generate. And
the reason it will listen to
that is because in practice we
also change how the
discriminator's trained. And now
the discriminator also knows
that the generator should have
generated a dog. And if it
generates a cat the generator is
not going to get a good loss for
that. So now it has to listen to
the conditioning information as
well because the discriminator
training itself as this will
change and this in practice
leads to better samples.
And the BigGan model, for
example, that I've shown is able
to generate very high quality
samples on ImageNet, is a class
conditional model.
But sometimes when you train GANs,
even class conditional GANs you
might get something like this.
This is what's called mode
collapse. So here are the model
instead of capturing the
diversity of the data is now
focused only on a few examples,
a few faces. And it's generating
them again and again.
And what we would like is a way to
automatically know whether our
model has collapsed or not.
We want to evaluate our samples
without looking at them, and
every iteration and so on.
And in practice that's a bit hard
because the generator loss is
not something very
interpretable. So often when we
train our models, we're used to
our loss going down smoothly.
But because we have this
two-player game here where the
generator improves, the
discriminator improves and so
on. The loss itself, shown here,
it doesn't really tell, tell us
much. So there's been a lot of
work of trying to answer the
question where how can we
evaluate GANs?
And, this is a
very difficult question even
answering the question broadly,
how are we going to evaluate
generative models is extremely
hard, so we have no metric
currently that is able to
capture all the desired
properties that we want from our
model. So some of these
properties are sample quality,
we want to be able to generate
high quality samples, but we
also want to be able to
generalise. We don't just want
our model to just give us
samples from
the original data because for
that we could have just used a
hash table and just say, give me
a sample from the original
dataset. And as Irina and I are
going to talk in another
lecture, we're often also using
these models for representation
learning and we might want to
answer the question, how good is
this GAN at representation
learning or how good is this VAE
at representation learning and
so on. And perhaps what we
actually want is to evaluate on
the base goal. So what are we
trying to do with the generative
model? Are we using it for
semi-supervised learning? So are
we using the features for
classification? Then maybe we
should use classification
accuracy. Are we using it for
reinforcement learning? Then
maybe we should use the agent
reward and so on. But in
practice, because that is hard
to do and also more expensive
and complex and it makes it
harder to compare models, what
people often use are
log-likelihoods. So you're
asking your model to explain
validation data that it hasn't
seen and based on that you are
assessing how good your model
is. But GANs are implicit so
we're not able to use
log-likelihoods to evaluate our
GANs. So people have come up
with other metrics to try to
understand how good our samples are.
And one such metric is the
Inception score. So in the
Inception score, what we're
trying to see is that the model
is preserving
the class ratio that we've seen
in the data. So imagine that we
have a dataset that has 50% dogs
and 50% cats. Then you want that
our model in practice is also
generating around 50% dogs and
50% cats. And notice here that
inception score doesn't care
about the individual dogs and
the individual cats. They can
all be the same as long as on
average we get 50% cats and 50%
dogs, Inception score is happy.
So the way this is done in
practice is that we can use a
pre-trained classifier often on
ImageNet to compare the
distribution of labels obtained
from data with the distribution
of labels obtained from samples
in a KL divergence sense. And
this metric is able to capture
sample quality because if the
model is generating garbage, you
won't be able to get anything
useful out of the pre-trained
classifier, so the distribution
of labels coming from samples is
going to be very different than
the distribution of labels
coming from data. It's able to
know whether you're fully
dropping a class or remember
mode collapse or we've seen that
the model can focus on one or
two aspects of the data. So if
you're dropping classes,
for example you're not
generating any cats, the
Inception score's going to
penalise you for this. And it's
also going to penalise you if
you're generating a lot more
dogs than cats for example. It
correlates well with human
evaluation, but it doesn't
really measure anything beyond
class labels. So as we've, as
we've seen, if you're generating
the same dog again and again,
Inception score's going to be
good, I'm happy. And because of
this, people have looked at
other metrics, for example,
Frechet Inception distance and
Frechet Inception distance is
not happy if you're generating
the same dog again and again. Now
it's looking both at the labels
in terms of are we generating
50% cats and 50% dogs, but also
inside the class. And the way it
does that is by looking at
features on a pre-trained
classifier rather just than the
output distribution of labels.
So if we're comparing now
instead of a KL sense in a
Frechet distribution, Frechet
distance sense, the distribution
of layer features obtained from
the data and the distribution of
layer features obtained from the
model, now we're getting a more
fine tuned metric. So again, we
can see sample quality because
we're also using a pre-trained
classifier. We're also able to
see if we're dropping for
classes altogether because the
feature is on average are going
to look very different than it
would do for only generating
dogs and forgetting about cats.
But it also goes beyond that and
it captures higher level
statistics. But there's a
problem with this metric. It has
been shown that it's biased for
a small number of samples and
KID has been proposed as a fix
in practice and see this paper
from ICLR 2018 for the,
for the fix.
But we also want to
go beyond this. We want to make
sure that our model has not
overfitted and has not just
memorised the data. We want
generative models that are able
to capture the essence of the
underlying distribution and the
statistical properties of the
distribution, but generalise
beyond that. And one way to
check this is to check for the
closest samples from our model
sample in the data. But we don't
want to do this in pixel space
because that's very noisy and
not really representative in a
semantic sense. So again, just
like we've seen with loss
functions when we used features
in our training or just as we do
in our model, we're going to use
neural networks features for
evaluation. So again, you're
using a pre-trained classifier
and we're going to search not in
pixel space, but in the feature
space of this classifier for the
closest images in our dataset to
our sample. So here we have an
example of a sample from BigGAN
and we're answering the
question, well, what are the
most similar ImageNet samples
from this sample? And the answer
is that, well, they are, they
are data of dogs in ImageNet,
but this exact dog does not
exist in ImageNet. So we have
dogs of the same colour,
different shapes, different
sizes. We have dogs and green
background, but this exact same
dog does not exist in
the dataset. So the model has
used the training and the data
to learn how to generate dogs,
but to generalise beyond what
it's seen. And the take-home
message of this part is to
remember that we need multiple
metrics to evaluate GAN samples
because we don't just care about
sample quality, we also care
about overfitting and so on.
And with this, I'm going to hand it
off to Jeff who's going to talk
to you about the GAN Zoo.
Hi. I'm Jeff Donahue. I'm a
researcher at DeepMind and I've
been working on developing and
improving adversarial networks
at scale. I am particularly
interested in the application of
GANs and other generative models
for representation learning a
topic I'll be discussing a
little bit later in this lecture.
So now that Mihaela has
given you an overview of the
theoretical underpinnings of
GANs, my goal for the rest of
the lecture is to take you on a
tour of the GAN Zoo to give you
an idea of the kinds of things
that people have been doing to
improve these models from where
they started to the
state-of-the-art now and all the
different domains and problem
settings where these models are
being applied.
A lot of GAN
research has focused on image
synthesis. So we'll start by
walking through the path that
has taken us from applying GANs
to small datasets like MNIST to
large scale image databases like
ImageNet.
And a good
place to start is the original
GAN paper from Ian Goodfellow
and his collaborators. In this
paper, they used relatively
simple data like the MNIST
digits that are referred to in
the title of this part of the
lecture and other datasets like
this faces dataset and the CIFAR
datasets. But they're all
pretty small images with
resolutions of about 32 by 32 or
smaller. In this paper, they
used relatively simple models.
In fact, for these top two
images that you see here,
the models were
multi-layer perceptrons or MLPs.
So they weren't convolutional
and they treated the images as
flat vectors, completely
ignoring the spatial structure
of the images. So there's
essentially no inductive biases
in these models. And when you
have data that's as relatively
simple as this, that turns out
to work pretty well, you can
see that the kind of digits that
you get are relatively
convincing imitations of the
real digits you see highlighted
in yellow here. These are the
kinds of digits that you can
generate with these kinds of
models. So it worked reasonably
well, but it was mostly just a
proof of concept that this sort
of model could work at all. And
it wasn't really meant to be a
demonstration of everything
these kinds of models were
capable of, which we'll get to
later. So moving on from that,
an extension that you can do to
these models as Mihaela
mentioned in her part
of the talk is to make them
conditional on a class label.
This early work on GANs called
conditional GANs generalises
GANs to the conditional setting
where we have some extra
information associated with each
piece of data, such as a
category ID in this case.
Instead of a category ID, this
could be something as
complicated as an image in
another domain, although in this
work, the conditioning was just
a category ID like cat or dog.
Um, so when you do this on MNIST
with the 10 digit labels, zero
one, two, three, four, five,
six, seven, eight, nine, you get
results like this where,
every row is a different
conditioning in this case, a
digit label. And it turns
out, when you give it a two
as the label, it produces
results that look like a two,
which so it's great. This works.
Next, we're going to
look at some early work that
actually managed to tackle some
pretty high resolution images
with GANs. There's this work
called LAPGAN by Emily Denton
and her collaborators. So
this work was really cool for a
couple of reasons, but just to
give you an idea of what it does
in terms of the generation
process. Basically they'll
start from a tiny image like a
four by four or eight by eight
image and they'll up sample it,
via Gaussian style
upsampling. So that gives
you a blurry image at
a twice as large resolution. And
from there what you can do to
get a final image, is you
generate the Laplacian. So
basically, so you can see if
you go from this image here,
this tiny image point, to
this image, all you have
to do is some trivial upsampling
operation. But then to actually
fill in the details, you
have to produce the Laplacian,
which is the difference between
the blurry image and the final
higher resolution image. So
you can add these up to get the
final higher resolution image,
the blurry image plus the
Laplacian. So the discriminators
job is to take both the blurry
high resolution image and the
difference image, either the
real one or the generated one
and decide whether that pair of
images is real or generated.
So this was a really interesting
formulation for a couple of
reasons in that it sort of,
decomposes the problem down to a
multistep generation process
with multiple discriminators and
generators, each one operating
at a different resolution. And
the discriminators and
generators are also conditional
as you have, the same piece
of conditioning information,
this blurry images that were
interested in upsampling. So
what you have in the end, is
this recursive way of going from
a small image to a high
resolution image. So this
was pretty exciting at the time,
especially because it was the
first GAN
paper to produce relatively high
resolution and convincing
images. And one of the other
nice things was that it's not a
deterministic upsampling. So
you can see on this slide it's
not producing the same high
resolution image for each low
resolution input image on the
left, it's actually producing a
full distribution of high
resolution images for each low
resolution image. And so you
have this tiny starting image on
the left, then you upsample
upsample upsample with a GAN
until you get to 64 by 64
resolution or whatever. And
because it's using random noise
at each stage as you have at any
standard GAN, you wind up with a
slightly different high
resolution output, whatever,
tiny input image you started
with every time you re-sampled
the noise, which is what you
want if you have a properly
trained and generalising GAN.
Another cool thing
architectually is that this was
a fully convolutional generator.
So it's taking a blurry, say 32
by 32 images input and
maintaining that 32 by 32
resolution throughout the
network to produce a 32 by 32
Laplacian as output and the nice
thing about that is it allows
you to apply the generator to
actually any resolution,
although it's only going to work
really well at the resolution
you trained it on. So for
example, in this case they only
trained it on up to 32 by 32
images. But you can keep
reapplying this
recursive upsampling and
Laplacian generation operation,
with the highest resolution
generator that you trained. And
then in the end, if you keep
doing this, you get what looks
like, continues to look like
higher resolution images,
although obviously it's a little
bit blurry and not necessarily
the best fidelity, but you can't
really expect too much more,
when the model's only ever seen
32 by 32 images.
Moving on to this paper called
Deep Convolutional GANs or DCGANs
from Alec Radford and his
collaborators. So this was
another really exciting paper at
the time because it was a very
simple architecture. It was basically
very similar to the original,
GAN framework, but with
deeper ConvNets. And it used a
batch normalisation, which made
this sort of notoriously
difficult, GAN training
process much smoother than it
was without batch normalisation.
The two networks, the
generator and the discriminator
were both ConvNets, so the
generator was a deconvnet or an
upsampling ConvNet and the
discriminator was a downsampling
ConvNet and it's basically a
five-ish layer network, not too
dissimilar from something like
AlexNet at the time. So when you
apply DCGANs to a dataset of
indoor scenes, you get results
that look like this, which were
at the time at least quite
impressive and exciting. And,
one of the cool things that
you can do with a network that's
trained this way is you can take
two noise or two Z
samples, Z1 and Z2 on this slide
For example, one of them
might produce an image of a desk
that looks like this, and one of
them might produce an image of a
bed that looks like this. And
then you can interpolate between
these two Zs in Z space and at
every point in between, you get
what looks like a relatively
realistic and semantically
meaningful result. So of course
it's not perfect, but one thing
that the shows is that the model
is able to properly generalise.
So it's able to turn a dataset
of a hundred thousand or ten
thousand discrete examples into
a continuous distribution of
images. And this also showed
that the model isn't simply
memorising the dataset because
obviously in the dataset you
wouldn't have an example of any,
interpolation for any given
pairs of images in the dataset.
And this is what happens if you
do that same kind of
interpolation thing for faces.
Again, obviously it's not
perfect and there's some kind of
creepy looking results in this
case, but it's still
interesting. One really
interesting observation from
this work is that there appear
to be some meaningful semantics
in the latent space. So
basically, in this, sort
of example, they observed that
if you take a latent that
produces a man with glasses from
a pre-trained GAN model, and
another latent that produces a
man without glasses and another
latent that produces a woman
without glasses, and you do
man with glasses minus man plus
woman, you get a woman with
glasses. And that might
remind you a little bit of the
Word2Vec results for language
embeddings if you're familiar
with that work. But what the
shows for GANs is that there are
directions in this, DCGAN
latent space that correspond to
the presence or absence of
glasses as well as the gender of
the subject, which is not
something that the model has
ever explicitly trained to do,
it's just sort of, learned
to sort these semantic
properties and represent them in
the latent space in some way,
which is really interesting.
We'll talk more about that later.
Jumping ahead a little bit,
there was a paper in 2018
called Spectrally Normalized
GANs from Miyato and
collaborators and this was
really exciting too. It was the
first real crack at using a
single GAN, a single generator
and a single discriminator to
model this ImageNet dataset with
a thousand classes and 1.2
million images. The main
trick in this paper was intended
to stabilise, GAN training
by clamping the singular values
of the discriminator weights to
one. So that all the weights
of the network had a singular
value of one, which basically
means that, no matter what
the input to a layer is, the
output magnitude is not
increased. And the way it's
implemented is every time you
run the discriminator's forward
pass,
you calculate an estimate of the
first singular value for each
layer. And because, this is
a linear thing, you can just
rescale the weight as shown here
by dividing by its singular
value, to get a normalised
version of the weights with
spectral norm one. So this
regularises the discriminator
and they're actually using here
essentially a linear loss
function, the hinge loss in this
case. So if you didn't have this
regularisation, the
discriminator could basically
improve its objective just by
increasing the magnitude of its
weights. But because you do
have the spectral norm
regularisation, the
discriminator has to improve its
objective in ways that actually
meaningfully improve the
gradient signals that it passes
back to the generator, which is
what we want that of a
discriminator. So when this is
applied to ImageNet, you get
images that look like this,
which at the time was a
particularly impressive because
nobody had successfully taken on
the full ImageNet dataset
with a single GAN before.
In some followup work from the,
the same group, they added this
idea of a projection
discriminator to handle
conditioning. So previously they
used the kind of input
conditioning we saw before where
you would feed in the class
label like pizza as the input to
the very first layer or there
were this, this other,
variant called ACGANs or
Auxiliary Classifier GANs where
you would train the
discriminator as a
classifier directly. So what
this paper proposed to do is
called a projection
discriminator. So they're
learning a class embedding which
is the same dimension as the
discriminator's last hidden
layer and they project the class
embedding onto the hidden
representation or dot product
it. And that gives you a class
conditional realness score,
that the discriminator outputs
basically, rather than feeding
the label as an input, it
becomes an output in this case.
And, there's a pretty
interesting theoretical
justification based on the,
underlying probabilistic model
that justifies doing it this
way. And it not only makes sense
theoretically, but performs very
well empirically and you see
results that look like this,
which was, even more
impressive than the results we
saw with SNGAN alone.
One more pretty interesting
innovation in the GAN
architectural space was what's
called self attention and self
attention is that this technique
for giving, networks, the
ability to do some sort of
global reasoning. It's been
applied in a lot of domains,
especially in language modeling
and machine translation. In
the image domain, it allows you
to basically learn to measure
global statistics about the
image. So for example, so
this was used in both the
generator and the discriminator.
And for example, if you're the
discriminator, you might
want to be able to ask questions
like if the tail of the dog is
on the left side
of the image, is the face of
the dog on the right side of the
image. Which is something
you might want to know if you
want to tell whether the image
is real or fake. And you
couldn't typically typically do
something like that with a
single convolutional layer
because the kernels are just too
small to capture that much of
the image. So this resulted
in, better global coherence
across the images that the GAN
would generate. And they
also have these nice qualitative
results to visualise what the
model ends up looking at. So for
example, in this case, it looks
like the model decided to,
compare this area around the
head of the dog, to this
area near the tail of the dog to
make sure you know that all the
dogs body parts are kind of in
the right place, which you
can imagine how that would help
the generator learn to produce
images with better global
coherence. And then at the end
of the day, you get results that
look like this on the ImageNet
dataset, which again was another
advance both qualitatively and
quantitatively in terms of
inception score, compared to
the previous results that we've seen.
So finally we get to
this project from our group at
DeepMind called BigGAN led by
Andy Brock. The main idea of
this work, which I think I'm
allowed to say
because I was a coauthor on this
paper, was to make GANs
really, really big. And we
wanted to do a big empirical
study and sort of digest all of
the image GAN research that's
been done so far and scale
them up as much as we could and
just kind of see where it would
take us. So, yeah, BigGANs,
we had, big, big batches,
big models, big datasets, big
high resolution images. So
the batch size that we used for
our main results was 2048
compared to batch sizes of
roughly 256 that were being used
before our work. And this turned
out to be a particularly
important hyperparameter, which
is really critical to making
these models work as well as
they did. And one
hypothesis for why this might've
been so important is that the
ImageNet dataset has a thousand
classes. And if you're doing
mini batch SGD, especially in a
setting that's as unstable as
GAN training still can be, you
really want ideally each class
to be represented in each batch
so that the model doesn't end up
sort of forgetting about classes
that it hasn't seen in a while.
And so if you have a batch
size of 2048, it's fairly likely
that in any given batch, almost
all of the thousand classes will
appear. Whereas obviously if you
have a batch size of 256, it's
obviously impossible for a
thousand
all thousand classes to be in
that batch. So we not only
trained on ImageNet , but also
this internal Google dataset
called JFT, which has, 300
million images. So we sort
of used ImageNet as our
development dataset when
designing these models
throughout the course of the
research. And then we directly
applied the same models to JFT
and we found that they worked
pretty well there, even on our
dataset, which was, you
know, 200, 200 or 300 times
larger. So you can see on
the right, the type of
images we get from this kind of
model and another few of them
are here. And so overall
this paper was a really big
empirical study to build up a
reliable kind of recipe for
large scale GAN training. So we
inherited quite a few tricks
from prior work, but, we
like to think that you can be
confident that each one was
ablated really well and turned
out and really turned out to be
the best choice in terms of the
image fidelity and the
quantitative scores that you
get. So, among these tricks,
we had the hinge loss, which is
basically a linear loss, except
it's sort of clamps to a minimum
value when the discriminator is,
or a maximum value, when,
minimum value, when the
discriminator is correct and
sufficiently confident in its
correctness and a spectral norm,
which we just discussed
as well as, self-attention
and projection discriminators.
And finally, some tricks that we
added to the toolbox relative to
previous work, included,
orthogonal regularisation, which
sort of enforces that each row
of the weights is orthogonal,
that they're kind of doing
different things. And we
used, skip connections from
the noise. So basically there
was a direct connection from the
noise Z to every layer in the
generator's, convolution
stack. And, similarly for
the class label embedding in the
generator, we used, we
learned an embedding that was
shared across the different
layers, each layer, again having
a direct connection from the
class conditioning as well.
One interesting trick that we
introduced was with this paper
was what we call the truncation
trick. It's an inference
time trick so it doesn't affect
training at all it's something
that you can do with any
pre-trained generator at
inference time when you're,
want to go produce samples. So
basically we can change the
standard deviation of the noise
input to the generator,
basically changed the scale of
the noise distribution as you
can see in the figure here. So
it's sort of shrinking closer
and closer to zero. So if
you watch the animation, we
start with this, you know, wide
distribution and the resulting
images produced for each class
at the beginning of this
animation, like now, are
quite different, but as the
distribution gets
skinnier, the images become more
and more uniform for a given
class. Basically what this does,
is when you make the
distribution really small and
near zero is it gives going to
give you kind of a prototypical
or a modal example of each
class. And in this case for
the do-, for these dogs, it's
typically a very well centered
and camera facing, example
of each dog, which is sort of,
inherited from the biases of
the dataset, because most people,
you know, will take pictures of
their dogs when they're facing
the camera. Whereas if you
keep the noise as it was at
training time, as you can see
here with, Sigma equals one
for the, for the Gaussian input
to the generator, you get
quite a bit more variety. So the
truncation trick is really a way
to trade off between the variety
and the fidelity of the samples
that you can generate, with
these models. And yeah,
here's just another example of
what happens with the truncation
trick for some bugs, some
butterflies, kind of the same
thing as we saw for the dogs. So
as I said, the truncation trick
is really a way to trade off
between variety and fidelity. So
what you can do is compute the
inception score and the FID at
every point along this curve of
Sigma values that you can
produce via the truncation
trick. So as Mihaela
explained
earlier, when she was talking
about evaluating GANs, the
inception score doesn't care
really about how diverse the
samples you produce are in each
class. It really just cares how
good the samples are for each
class, how confident it is in
the classifications for each
class. So if you just want to
maximise inception score,
setting the scale to roughly
zero, is really the best
thing you can do. And when you
do that, you see that you end up
maximising inception score down
at this point on the curve here
at around 210 in this case.
But when you do that, you have
relatively bad FID of 30+ and
higher is worse for FID. On
the other hand, if you leave,
Sigma equals one and the
other end of the curve here,
the, which is the default, the Z
distribution as it was at
training time, you get
relatively bad inception scores,
roughly 105 or 110. But
very good FIDs as you're
capturing more of the intraclass
distribution, which FID is a
little bit better at measuring.
So as, so kind of as an
alternative and more detailed
way to evaluate GANs, you can
look at this full truncation
curve. Whereas previous work
had just looked at,
individual points using the
default distribution. So it sort
of gives you a full frontier of
the inception scores and FID
scores across this entire curve.
One more thing that we
played with sort of late in this
work was, this different
architecture called BigGan-deep,
that you see here. So this
is a deeper yet more efficient
architecture. You can see in
a single block, it has twice as
many convolutions in the main
block, so there's four of them
instead of two. And we had twice
as many of these blocks in the
BigGAN-deep architecture. So
overall it's four times as deep.
The key thing that makes
this, even more efficient
than the original BigGAN is that
we have these, it's not a
new idea, but, we added
these one by one convolutions that go
to a lower channel count and
then these three by three convolutions
operate on this, lower
channel count, space. So it
all ends up at the end of the
day, and the three by three convolutions
are the most expensive part so
it all ends up being a little
bit more efficient than the
original architecture. And the
nice part is that it also
performs better with inception
scores of over, 240 at full
truncation down here. And FID is
around six with the minimal
truncation. Now this model is,
definitely not perfect and a
lot of times the failures are
kind of fun to look at as well.
So for example, this image on
the left that we, sort of
affectionately refer to as
DogBall.
And this is an example of
what we call class leakage. So
according to BigGAN, this image
is an example of a tennis ball.
So the reason that we think
this happens for ImageNet
specifically is that there are
just so many dogs and the
ImageNet dataset, there's
roughly a hundred dog classes.
So the model is sort of very
accustomed to seeing dogs and it
sees them roughly a hundred
times as often as tennis balls.
So when it sees tennis ball, it
says, you know, hey, that's
fuzzy, it's probably a dog, I'm
going to put some eyes and a
snout on it. So this happens
at least some point in training
and it's not actually from the
final converged model, but it's
kind of fun to see what happens
as the model's learning generate
better and better images
throughout better and better
images throughout training. And,
other failure modes include
classes that are difficult,
particularly any class that
includes the human face. Now
it's a, could be a little bit
just that they seem particularly
bad because humans are very
sensitive to how good human
faces look or how realistic they
look. So there's kind of this
Uncanny Valley effect,
although we're quite a bit off
here, I think you'd probably
agree. And classes with
really complex structure, like
the image of this band here are
also really hard with a lot of,
when they have
a lot of different objects in
the scene and classes that are
underrepresented in the dataset
and have, also have complicated
structure like this image of a,
I think it might be a tuba
or a French Horn or, and
it's just really hard for the
model to capture this sort of,
complex structure without,
too many examples and
especially to generalise to new
instances of the class as you're
sort of asking, BigGAN to do.
So more recent followup work
that we did is this work called
LOGAN or latent optimisation GANs.
So latent optimisation
is this idea intended to improve
the, adversarial dynamics of
the GAN game between the
generator and the discriminator.
And basically what it does is it
uses what's called the natural
gradient descent to optimise,
G's latent inputs, the Zs.
So it changes the Zs at training
time to make the discriminator
happier. So it does, one
natural gradient descent step
inside of the training loop,
to change Z. And it actually is
going to backprop through this
entire process. So it's a
little bit more expensive than a
standard GAN. It takes about,
roughly twice as much
computation time per step, but
it results in really
significant
improvements in
BigGAN, in terms of the
variety and the fidelity that
you can get. And it's
particularly noticeable when you
compare along the truncation
curve. So for example, if we
truncate such that
the inception score is roughly a
259, you get much better FIDs
when you train using LOGAN than
with a standard BigGAN-deep.
So both, so LOGAN is about
FID 8 versus BigGAN-deep,
about 28 at the same point. And
it's obvious also if you just
look at the samples, at this
point in the truncation
curve BigGAN-deep is basically
producing all uniform samples
per class, whereas LOGAN still
has pretty diverse samples.
So a parallel line of work to
the BigGAN work and all of the
ImageNet work was this line of
work from Nvidia. The first,
work in the series was
called progressive GANs. The
idea of this was sort of similar
to what they did in LAPGAN,
although it's formulated quite a
bit differently. So, the
idea here is both for efficiency
and to get the model to converge
dependably they start off
generating at a very low
resolution, like a four by four
resolution. And then after your
tiny image generator has
converged, you can add an extra
upsampling layer you like you
see here. And a few extra
convolutional refinement layers
to get an eight by eight image
generator. If you started with
four by four, then you wait
for that one to converge you
repeat for 16 by 16, 32 by 32
and so on and so on until you
get up to the final resolution
that you would like to generate. And
in their case, they went to very
high resolutions of up to 1024
by 1024 and in the end, this
resulted in, extremely
compelling images, at least in
this restricted domain of
celebrity human faces. And
you get what looks like pretty
much photorealistic results of
human faces at this very high
resolution of 1024 by 1024. You
know, at least for me, it's very
hard to tell the difference,
between most of these faces and
real human faces. The followup
work from this team was called
StyleGANs. So StyleGANs were
also shown to be capable of
generating a remarkably photo
realistic face images. And in
this case, they used what was
probably a more challenging
dataset than the last one with a
lot more variation in the
images. The dataset they used in
the previous work, progressive
GANs was mostly images of
celebrities, whereas this
dataset was a lot of, a lot more
diverse and mostly consisted,
consisted of images of
not-so-famous people. So the
interesting thing about the
architecture that they used in
this work was that they had
these structured latent inputs.
So they had these, the
usual global latents, the
usual Zs, that you have as
inputs to the generator, but
they also have these spatial
noise inputs. So you can see in
the image that each column has
sort of the same global
structure, global semantics,
like this middle column, for
example, it seems to be a latent
corresponding to, you know,
young children. And this
column seems to correspond to
being centered on the right side
of the image and looking towards
the center. And that's because
each column uses the same global
latent or is the spatial latent
is the same in each row and it
seems to mainly control, in this
case, the sort of
background, of the image as
well as the skin tone.
So, what the architecture looks
like, is on this slide.
So on the left we have the
usual, flat vector Z, which
they explicitly called the
latent. And it's passed
through a sequence of eight
fully connected layers, an MLP
to get the final latent vector
down here. And then this latent
is input into every hidden layer
of the generator. But the
interesting new piece here is
that they also have these pixel
noise inputs over here. So
at every layer you have a single
channel of random noise of the
appropriate resolution, so four
by four, eight by eight and so
on and so on. And that noise is
going to get reincorporated at
each of these layers. And as we
saw before, it ends up using
this global latent to control
the overall global appearance of
the image. Well these pixel
noise latents are used to
control the local
variation of the image. And
another example of what this
looks like in action is on this
slide. So if you freeze the
global latents and the coarse
level pixel noise, if you freeze
all of those, you can change
just the fine high resolution
pixel noise to get stochastic
variations, you know, in this
case, controlling sort of the
fine differences in how this
toddler's hairs look. So I
hope that what you can take away
from this part of the talk is a
couple of things. First
there's been pretty rapid
progress in the span of about
five years scaling up
GANs from the MNIST digit
images, that we saw in the
original GAN paper to these
pretty large scale databases of
high resolution images like
ImageNet and the Flickr-Faces-HQ
dataset. And the improvements
occurred really in a variety of
different places. It wasn't just
about changing the architecture
or changing the objective. It
was really all of these things
combined. The G and D
architectures have gotten better
and deeper. The conditioning
structure has changed, the
normalisation has improved. We
saw that batch normalisation and
spectral normalisation were
quite helpful. The
parameterisation of the
discriminator has changed.
We started off taking the
conditioning vectors input. Now
with the projection
discriminator, we project the
class embedding onto the hidden
representation of the image.
The latent space structure has
changed. For example, in the
StyleGAN paper where we
had the pixel noise latents to
control local appearance and the
loss functions have changed,
which we saw more in Mihaela's
part of the lecture. And the
algorithms have changed. For
example, in LOGAN where we have
an inner optimisation of the
latents. But while we can
produce some pretty convincing
image, I'd say the problem is
still pretty far from solved.
For example, the
state-of-the-art methods take a
good amount of time and quite a
bit of computation to converge.
And even with BigGANs, you know,
we're still not great at every
single image category. So I hope
this gives you a good idea of
how the research has taken shape
into what the state-of-the-art
is today. And you know, maybe
even inspires you to try your
own ideas and make these methods
work even better.
So, next I want
to talk about an application of
GANs that I'm particularly
interested in, which is the use
of GANs for representation
learning. You'll hear a lot
more about the topic of
unsupervised representation
learning in the next lecture
from Mihaela and Irina. But
for now I'm going to address a
few of the directions that
people have been thinking about
in terms of using GANs in
particular for representation
learning. So just to give a
couple of motivating examples
for why it might be interesting
to use GANs for representation
learning. This is a slide that
we saw before,
but just to remind you, so in
the DCGAN, work, Alec
Radford and collaborators
noticed that in the latent space
of a deep convolutional GAN or
DCGAN, you can do these kind of,
arithmetic operations in
latent space indicating that
certain directions in latent
space correspond to high level
semantic attributes in the
observation space. In this case
human faces, such as the
presence or absence of glasses
or the gender of the subject.
And all of this arises without
the GAN ever being explicitly
told without, without ever
being explicitly told about
these concepts of, glasses
or gender. As another motivating
example, I took the BigGAN
architecture and I added an
extra latent variable to the
generator input. So this is
a categorical latent variable
with 1,024 outcomes and it's
just fed into the generator
as a one-hot variable in
conjunction with the regular
continuous latent variable, the
120D Gaussian. And the kind of
things that you get out of this
are pretty interesting. So I
trained this without class
information. It's unsupervised
and unconditional. But it
does have this use, this
categorical latent variable in
place of the usual explicit
class label that you'd get in
the conditional supervised
setting. So it seems to learn to
associate this categorical
variable with high level
semantic groupings, that
almost look like image
categories. So in this slide you
see about eight sort of
randomly chosen outcomes of the
1000 way categorical variable.
And for example, in this,
one value of this categorical
variable shown in the first row,
corresponds to what looks
like sea anemones. Another one,
looks like a certain
breed of dog and a sort of
grassy green background.
Another, it looks like these
kind of mountainous landscapes.
So this is really cool
you can imagine that in a sort
of idealised case, the dream
might be that it learns a
clustering all on its own that
looks exactly like, say the
1000 ImageNet categories, or at
least each of these categories
might be represented by some
combination of these categorical
variable outcomes. And if
that were to happen then
training a model that could
predict this latent variable
given an image, would be exactly
like training a fully supervised
ImageNet classifier. And of
course, all of this came for
free because it's unsupervised.
So it's not like the ImageNet
dataset where we had to
manually label each of the
images with its category ID or
you know, pay somebody to do
that. So going towards that
dream, there've been many
attempts to get models that
fulfill this promise of learning
representations using GANs,
completely unsupervised. And
I'll discuss just a couple of
them here. One of the first,
interesting papers from a
few years ago, it was called
InfoGANs or
information maximising GANs and
compared to regular GANs that
adds this inference inference
network to recover the latent
code Z given the generator
output G of Z, which in this set
of experiments that we're
looking at is an image, an image
of an MNIST digit. And what this
does is force the generator to
use each of its input latent
variables meaningfully in order
to maximise the information
content about the variables in
the images that it outputs. And
when you train it with these
latent codes, it learns to
associate each outcome of the
categorical latent variable with
a different digit value. And
it used the continuous valued
variables to vary the style and
the size and the rotation of the
digit. So basically it's
using the discrete latent to
capture the discrete variation
in the dataset and the
continuous latent to represent
the continuous variation of the
dataset. So that's pretty cool.
So one, sort of
disadvantage of this approach
when it comes to representation
learning is that you don't have
a ground truth, latent
associated with real images like
you do for generated images. So
the inference network that
you've added here is only ever
getting to see generated images
where you have, where you do
have the latent. And so that
might be okay for representation
learning when you have a very
simple dataset, like MNIST
where the generator is able to
capture it almost perfectly,
like you can kind of see on this
slide, but when you go to
something more complex like
ImageNet, if your generator
isn't perfect and it probably
won't be, cause ImageNet is
still really hard, if your
generator isn't perfect then
when you go to apply the learned
representations trained on these
generated images, it's going to
be kind of a domain shift
between the generated images
that the inference network has
seen, versus the real images
that you want to get feature
representation for. So then
comes this other class of
methods that was called either
Adversarially Learned Inference,
ALI, or bi-directional GANs or
BiGANs. And this is sort of an
adversarial approach to jointly
learning to generate data and
learn representations from it.
So compared to a regular
GAN, the setup adds an encoder
network, which we'll call it E
for most of this, which
learns the inverse mapping from
the generator G. So whereas the
generator maps, from
features or latents to images, G
of Z, the encoder does the
opposite. It maps from images
or data X to latents E of X. And
the other difference from a
regular GAN is you have a joint
discriminator, so it sees not
only an image or a data point X
or G of Z, but it also sees the
latent Z or E of X, so these X,
Z tuples can either come from
taking a data point X and
passing it
through the encoder to get a
predicted latent E of X or it
comes from sampling latent Z and
passing it through the generator
to get an image, G of Z. And
then the discriminator's job
here is to figure out which of
the two generating processes
each of its input tuples came
from. And the generator and
encoders job are to fool the
discriminator basically into
picking the wrong process.
And it might be a little
confusing when you first look at
this cause it's not entirely
clear what the encoder's job is,
like, why does it have to
produce anything in particular
for a given X? So, well it
turns out that under this
objective of discriminating
between these two different
types of tuples, there's a
global optimum here where if you
have a perfect discriminator and
the generator and encoder are
perfectly satisfying the
discriminator, then it turns out
that the encoder and generator
have to invert one another. So
if you pass, an X an image
through the encoder and get a
predicted latent E of X and then
you pass that back through the
generator, it should perfectly
reconstruct the input X. That's
the global optimum of this
model. And, unlike in say
auto encoders where you
explicitly train for this
property by minimising a squared
error. In this case, the encoder
and the generator communicate
don't communicate at training
time. So they, they never see
each other's outputs. It's
all done through the,
the joint discriminator. So the
encoder never sees the outputs
of the generator and the
generator never sees the outputs
of the encoder. So one thing
that makes this interesting for
feature learning is that the
encoder never suffers from the
domain shift problem I mentioned
before of having to see these
kind of weird, bad, or at least
initially bad generated images
that the generator gives you. It
only ever sees real data, which
is exactly what we want for
representation learning because
it means that there is no domain
shift when we go to apply the
encoder to real images. So in
practice, this inversion
property that we proved to be
true at the global optimum,
doesn't actually hold perfectly.
But, what you see is that
the reconstruction that you get
from passing X through the
encoder and the result back
through the generator often
capture quite interesting
semantics of the inputs. So for
example, if we look at the
digits here, often the digit
identity between the original
data X and the reconstruction G
of X, is the same. So
for example, you know, two goes
to two, three goes to three, et
cetera, et cetera. So what that
tells you is that the
representation the encoder gives
you is capturing the digit
identity at least to some
extent. And this is all just
from looking at the data. We
never explicitly tell it what a
five looks like and so on.
So if you scale these models
up because, the original
work that we just looked at was
sort of at the DCGAN scale. If
you apply this in the BigGAN
setting, where you have the
same generator and discriminator
architectures as in BigGAN, and
you add an encoder model, which
is something like a
state-of-the-art recognition
image recognition model,
like a ResNet style model at
least a few years ago, some very
interesting things happen, and
we call these resulting
models with a few other tweaks
that you can read about in the
paper, we call them
BigBiGANs naturally. So for
example, if you pass this dog
through the BigBiGAN encoder and
back through the generator to
get a reconstruction, the
reconstruction that you get is
what looks like a pretty similar
dog although with its tongue
stuck out and kind of facing in
a slightly different direction.
This person in a red coat in
the winter becomes a slightly
more zoomed in person, in a red
coat in the winter. So in
general, many of these semantic
properties of the input get
maintained in the
reconstructions, even though the
model is never told what
semantic properties are
interesting. And, all this
is happening because the
structure of the discriminator
is essentially shaping an
implicit reconstruction error
metric in semantic ways. At
least this
is kind of my intuition for
what's going on. So the
discriminator is a convolutional
network and we know that
convolutional networks are good
at predicting semantic
attributes of images. So the
resulting implicit, that
reconstruction error that we're
minimising, implicitly not
explicitly mind you, but,
but this sort of implicit
reconstruction error emphasises
the semantics remaining the
same, even if the individual
pixel values change quite a lot.
So for example, the model isn't
going to remember exactly what
kind of pizza you gave it, but
it will remember it was some
kind of pizza and it was roughly
in this part of the image. So
it's almost kind of human-like
in terms of what it remembers
about the input image, it has
this sort of fuzzy semantic
memory of what it saw without,
for example, having to remember,
you know, the exact position of
every single blade of grass. And
this is in contrast to the
standard pixel wise
reconstruction objectives where
it's basically forcing the model
to remember every single pixel
value. So this is in some
sense exactly what we want in a
representation learning
objective, which is what at
least you know, in my opinion,
makes this an interesting
method. And when you,
evaluate this quantitatively in
this sort of standard setup
where you basically take the
encoder and use it as a feature
representation and train a
linear classifier supervised on
top of that, you get something pretty
close to state-of-the-art
results compared to all of these
self-supervised methods that are very
popular these days and which
you'll, I think, hear about in
the next lecture. And
another way to see what
representations are being
learned by this method is by
looking at nearest neighbours in
the dataset. So you can take
images from the validation set
as queries from this left, shown
in this left hand column here,
and find the training set images
that are closest to them in
BigBiGAN feature space. So in
general, you can see that the
nearest neighbours tend to be
very semantically relevant to
the input image. In fact,
you know, with this dog from the
validation set here, it's
nearest neighbour in the
training set, shown here, I
think based on the background,
it's in fact exactly the same
dog even though it's obviously
facing a different direction.
And if you just looked at the
pixel values, this would be
quite different. So it was kind
of cool that, out of 1.28
million images in the training
set that ended up being the
nearest neighbour, that same
dog at a different angle,
although it was probably a
little bit lucky but still fun.
Finally for the last part of the
talk, I just want to give you a
taste of some of the other
modalities and the different
problem settings that people are
trying to tackle using
generative adversarial networks.
So, starting with a couple of these
in the image space, one of the
coolest lines of work in my
opinion, started with this paper
called Pix2Pix by Phil Isola and
his collaborators. And what they
did in this setting was train a
generator to translate between
images from two different
domains. So for example, if you
had satellite images like these
and you wanted to be able to
automatically translate these
images to kind of roadmap type
images like you see here,  and
the way that they do this in
Pix2Pix is you take all of these
paired examples of images, so
the satellite image view and the
corresponding map view of the
same area. And you train a
conditional GAN that takes the
aerial view as an input and
produces the map view as an
output. So the way you train
this thing is you have a
standard GAN objective. You have
a discriminator that says, does
the output of the generator look
like a map view that I've seen
before? But you also have
this, L1 reconstruction
error. So since you have a
ground truth for what this
aerial or this map view is
supposed to look like, you,
you can use this kind of L1
pixel-wise reconstruction error
to tell the generator that this
is exactly what your output
should look like for this input.
So basically it's kind
of like a traditional supervised
learning setup, and you can see
that this works in a number of
domains. As you can see on
the slide, labels to street
scenes, edges to
photographic images of purses,
for example. And, yeah.
So it's quite cool. But in
the more general setting, you
might not actually have paired
examples. So for example, if
you want to train, a GAN that
translates between images of
horses to zebras or vice versa,
you're probably not going to
have paired images where all the
horses and all the zebras are in
the exact same positions in the
image like we assumed we had in
the Pix2Pix work that we just
talked about. So enter this
method called CycleGAN where you
want to be able to sort of
unsupervised, be able to
translate between two different
domains without but without
paired samples between these
domains. And the high level
idea of how this works is by
enforcing this, property
they call cycle consistency,
in addition to all the normal
GAN objectives so it's still a
GAN, so you start with an
image X in domain A,
say it's a image of zebras and
then you translate to domain B,
say it's an image of horses and
then translate back to domain A
so translate back to zebras and
the zebra image that you get
after that process should look
pretty much exactly like the
zebra image that you started
with. So that's gives
you an idea of how the method
works. And as a result, you
can basically translate between
any two domains that have sort
of reasonably similar
information content, such as
going from summer scenes to
winter scenes, horse scenes
to zebra scenes, photographs to
different artists. So this
is a really cool approach. It's
almost, you know, a little
bit magical that it works and it
produces some really cool
compelling results.
Now, I'm going to touch on a little bit
of work using GANs for audio
synthesis. So WaveGAN on the
left here was one of the first
attempts to produce raw audio
wave forms using GANs and they
showed that, for example,
you can train unconditional GANs
to produce reasonable, one
second clips of piano music or
human speech. MelGAN was
work on text to speech that
takes as input, Mel
spectrograms and produces raw
speech audio as output. And
then there was this other text
to speech, work from our
team at DeepMind called GAN-TTS
where we take the linguistic
features aligned in time as
input and produce, also produce
raw speech audio as output. And
both of these text to speech
methods, work reasonably
well for speech synthesis, which
is pretty exciting because
they're also quite efficient
relative to many of existing
state-of-the-art
approaches to text to speech.
So, in addition to images, people
have also used GANs to generate
videos and predict future frames
of videos. So you can apply a
lot of the same tools and
toolbox that we've used for
images to videos as well. Of
course, since you know, since
you have within a frame the same
two dimensional structure that
we have for images, a frame is
an image. But you also have
a third dimension - time - and that
turns out to make this problem a
bit different and arguably quite
a bit harder than it is for
images. Partially just
because of the computational
resources it takes to store and
generate videos versus still
images. But also because
humans are quite sensitive to
unrealistic motion. So it's
important to get that right in
order to have reasonably
convincing results. So in
all three of these methods on
the slide, a lot of a lot of
work has gone into making that,
computationally feasible.
So, one thing that we did in DVD-GAN
for example, in the middle here
and it was further developed in
TriVD-GAN, was to decompose,
the discriminator into two
separate discriminators, neither
of which are seeing all of the
pixels in the video. So it ends
up being computationally
feasible that way. So there's
one discriminator that we called
the spacial discriminator. It
operates only on a few
individual full resolution
frames, but it only sees
a few of the frames, a subset of
them, so that, but that
discriminator basically ensures
that each frame looks coherent,
independently. And then
there's another discriminator,
the temporal discriminator that
sees multiple frames but they're
spatially down sampled. So
that also doesn't see all the
pixels cause it sees down
sampled, versions of the
images. But that one is going to
ensure fluidity over time.
So together that, makes the
problem from almost
computationally infeasible to
being fairly feasible. And
finally, just to give you a
final taste of the many domains
in which people are applying
GANs, there's reinforcement
learning. So this is work on
using GANs for imitation
learning called Generative
Adversarial Imitation Learning
or GAIL. And essentially it uses
a GAN-like method to learn a
generator, which in this case
ends up being a policy which
learns to imitate expert
demonstrations by fooling a
discriminator whose inputs are
state-action pairs. And it
addresses, many of the
typical problems that people see
with standard, behavioural
cloning methods, in
reinforcement learning. There's
work on using GANs for image
editing so that amateur
artists for example, could
specify just the coarse layout
of a scene without having to
actually paint every single
detail and then the GAN can go
in and fill in the low level
details with some pretty nice
looking results. And they have a
pretty fun demo that
you can try out online if
you're, if you're interested.
There's work on using GANs
for program synthesis. There's
this work from DeepMind called
Spiral where you have a
generator that instead of
specifying each pixel value has
to specify individual actions
like the brush strokes in a
painting program. So it has to
produce these discrete
instructions and you can't
directly backprop through this
generation process like you can
in sort of standard image
generation GANs. So you end up
having to use a reinforcement
learning approach to do this.
And you can imagine that you
could apply this to all sorts of
different types of programs, not
just, drawing ones.
There was a really cool piece of
work recently called,
Everybody Dance Now, which was
used for motion transfer. So
you could take photos of
somebody, in different
positions who's not a very good
dancer and map the movements of
a professional dancer onto their
body. So it looks like they
have, you know, professional
level dance skills and if
you haven't seen the video demo
of this already, you really have
to go look it up and watch it
cause it's pretty entertaining,
pretty amazing and super
entertaining. GANs have also
been applied to domain
adaptation. So domain
adaptation, if you don't know,
is this problem where say,
you might have a bunch of
labeled images of things
happening during the day, within
the daylight and you want to train a
classifier on that data and then
apply it to images of things
happening at night. And by
default, this won't work very
well as there's going to be a
domain shift between day scenes
and night scenes. And there's
different methods of alleviating
that problem. Some of them are
using GANs like this one here.
And finally there's a number
of artists using GANs for
different kinds of,
human-machine collaborative
artwork kind of, and they
produce some really compelling
art this way. This is just
one example of that called
Learning to See from an artist
Memo Akten whose work you should
definitely check out if you're
interested in. Cool. So
thank you. I hope this lecture
has given you a good idea of the
broad array of things that
people are doing with GANs. And
I hope this might even inspire
you to look further into some of
these applications or try some
new applications of your own.
Thanks.
