Hello and welcome to
lecture 11 in the UCL DeepMind
lecture series on topics in deep
learning. My name is Andriy Mnih
and I'm a Research Scientist at
DeepMind. I work on generative
modelling, variational inference
and representation learning.
This lecture will cover latent
variable models as well as
various types of inference and
in particular variational
inference.
So the lecture is
structured as follows: I will
start by introducing generative
modelling and covering the three
major types of generative models
used in deep learning.
Then I will focus on
latent variable models,
explain what they are
and why inference is so
important for them.
Then we will cover
a special case of latent
variable models, invertible
models, where we can do exact
inference.
Then we will move on
to intractable models where
exact inference is not an option
and we will look at variational
inference for training those
models.
Variational inference
requires estimating gradients of
expectations, which is not a
trivial problem so then we will
look at how to estimate these
gradients.
And finally, we will
look at a modern application of
variational inference to
powerful models, which results
in Variational Autoencoders.
So, let's look at
generative modelling.
What are generative models?
Well, generative models
are simply probabilistic models
of high-dimensional data.
So, conceptually they describe the
process, the probabilistic
process of generating an
observation. And we can think of
them as describing mechanisms of
generating more data points.
And the key
distinction between other
probabilistic models and
generative models is that our
distributions that we are
modelling are really high
dimensional.
So, in classic
settings like classification or
regression, you're basically
modelling a one-dimensional
output distribution while in
generative modelling, you are
dealing with a high-dimensional
distribution, and often you
essentially don't have an input.
So you're just modelling the
distribution of the output.
For this particular reason
generative modelling has been
seen as a subarea of
unsupervised learning because
we're simply modelling the joint
distribution of the data, and we
don't have any labels.
On the other hand,
if you think about
generative models as including
conditional generative models,
which also have a context, which
is quite a bit like an input,
then the boundary becomes rather
blurry. So it's really more
about the technology rather than
the actual application it is
used for.
And there are many
types of generative models and
they can handle essentially any
type of data from text to
images, to video and so on.
So, let's look at some uses of
generative models.
The most established and maybe
traditional one comes from
statistics and it's called
density estimation. And here we
simply fit a generative model to
the data in order to get a
probability distribution that we
can evaluate at any given
datapoint. And once we have this
probability distribution, we can
use it to actually tell, is this
given data point from the same
distribution as
the training data? Or is it an
outlier from some other rather
different distribution?
This kind of model
can be used for
applications like fraud
detection.
There's also a close
connection between probabilistic
modelling and data compression.
So there's actually exact
duality between these two areas.
So if you actually have a
probabilistic model of the data,
you can use arithmetic coding to
produce a data compressor.
We can also use generative models
for mapping between two
high-dimensional domains, for
example, between sentences in
one language and their
translations in another
language. So here the sentence
in the original language will be
the context and the model will
capture the distribution of
possible translations for the
given sentence. And typically
there will be many possibilities
rather than just a single
correct translation.
Another exciting application of
generative modelling is in
model-based reinforcement
learning, where the generative
model essentially acts like a
probabilistic simulator of the
environment. So then the
algorithms can actually use this
simulator to plan optimal
sequences of actions rather than
actually having to try them in
the environment to see what
happens. And once we've done
this planning, we can actually
execute the sequence of actions
in the real environment.
Some types of generative models
are really useful for representation
learning where we would like to
condense the observations down
to some essential features, some
sort of low-dimensional
representations of them that
capture the essence and
these low-dimensional
representations might be more
useful than the original
observations for downstream
tasks, such as classification.
And often we don't actually know
what the downstream task will be
so it's important to summarise
the data in a generic way and
generative models provide a way
of doing that.
And finally,
there's this idea of
understanding the data that also
comes from statistics and this
is the area where the generative
model will have a particular
meaning to its structure.
So, the latent variables will
potentially be interpretable or
the parameters will have some
real world significance.
So, once we train
such a model on the
data, we can look inside of it
using inference, for example, or
look at the parameter values and
that will tell us something
about the data distribution,
something that we can't easily
see just by looking at the
individual data points directly.
So the next few slides are meant
to give you a sense of rapid
progress that has happened in
generative modelling in the last
few years.
So the individual models
are not very important,
so I'm just showing you samples
from models trained on datasets
typical for that particular year.
So, we start in 2014 where
the typical dataset was MNIST
which contained low-dimensional
images, binary images, of
digits.
Then one year later
there's been already some
progress, and now we can have
models that capture to some
degree, the distribution of
natural images,
still low-dimensional, but now
they're in colour and they're
considerably more complicated
than digits.
The images are
indeed blurry, but we can see
some global shapes and maybe
some of these objects might be
recognisable to you.
And then four years later,
we can model
much higher-dimensional images
with much better results.
So these are not perfectly
photorealistic, but the local
detail is very convincing and
the global structure is quite
good as well. There's clearly
room for improvement but it's a
long way from the binary images
of digits.
So, let's look at the
popular types of generative
models in deep learning.
You have seen actually many of these
mentioned before in the
preceding lectures in the series
so I'll just give a very brief
overview.
So, autoregressive
models are most prominent for
language modelling where they're
typically implemented using
Recurrent Neural Network or
Transformers.
Then we have
latent variable models, which are
subdivided into tractable, such
as invertible or flow-based
models and intractable ones like
Variational Autoencoders.
And this is the kind of model we
will cover in depth in this
lecture. And finally there are
implicit models most notably
Generative Adversarial Networks
and their variants. So let's
look at each one of these types
in slightly greater detail.
So, autoregressive models solve the
problem of modelling the joint
distribution of observations, x,
by subdividing it into simpler
subproblems. So instead of
modelling p of x directly, we
actually model the
one-dimensional conditional
distributions corresponding to
this joint distribution.
The resulting
model is tractable and can be
easily trained using maximum
likelihood.
So, why is this a
good approach?
Well, one dimensional distributions
are actually quite easy to model
because we can use the off the
shelf classifier technology that
has been very successful in deep
learning and such models are
simple and efficient to train as
we don't need to do any kind of
sampling of random variables at
training time.
On the other hand,
because we're modelling a
sequence of dimensions of
conditional distributions
sampling from such models is
inherently a sequential process,
which means it is slow. We have
to go through one dimension at a
time and we can not easily
parallelise this.
The other weakness of
such models is that
they naturally focus on the
local structure rather than
global structure. So, unless you
build some sort of inductive
bias towards capturing the
global structure into the model
directly, you are likely to have
less success with modelling the
global structure with these
models.
Then we have latent
variable models, which are also
likelihood-based like
autoregressive models, but they
take a different approach to
modelling the joint distribution.
So, they do it by introducing the
unobserved latent variable,
that in some sense, explains or
generates the observation.
So we start with the latent variable
and then we also define the
transformation that maps the
latent variable value to the
particular observation.
These models are also trained using
maximum likelihood
or more
typically some approximation to
maximum likelihood because often
maximum likelihood is
intractable here
and
latent variable models provide a very
powerful and well understood
framework, a mature framework
that has been around for a long
time in statistics.
They make it really
easy to incorporate prior
knowledge and various structure
constraints into the model.
So, if you would like to model some
sort of statistical or physical
process, you have some ideas
about how it's structured,
this is typically the model type
you will use.
And because generally
they don't use autoregressive or
sequential sub-components,
sampling from such models is
efficient.
On the downside,
these models require
understanding the concept of
inference, which is the reverse
of generation. So this means
going from the observation to
the plausible latent values that
could have generated it.
So you need to understand
and implement
this concept in order to use
these models that makes them
somewhat more complex than
autoregressive models.
And as I mentioned previously,
for many such models
inference is intractable.
So, either we have
to introduce the additional
complexity of using
approximations for inference, or
we have to restrict ourselves in
what kind of models we can use
in order to ensure that
inference remains tractable.
And the third class of popular
generative models in deep
learning are Generative
Adversarial Networks.
And unlike the previous two types,
these are not likelihood-based.
These are so-called implicit
models because they don't
actually assign probabilities to
observations, they just give you
a sampler that generates
observations.
So, the model here
that we are training is simply a
neural network that takes a
vector of random numbers and
maps it to the observation.
And unlike the other two classes of
generative models we just looked
at, these models are trained
using adversarial training
rather than maximum likelihood.
So, adversarial training works by
introducing an auxiliary model,
a classifier, that is trying to
discriminate between samples
from the generator, the model,
and the training data.
And the gradients
from this classifier
provide a learning signal that
we can use to train the model or
the generator.
So, the main appeal
of these models is that
they are by far the best ones
for modelling images. So the
images they generate are
extremely realistic.
They are also relatively easy to
understand conceptually because
you don't need to understand the
concept of inference and you're
training a model simply by
backpropagating through a
classifier.
And like latent variable
models, they provide
fast generation because
generating an observation
involves simply performing a
forward pass in a neural
network.
On the other hand,
Generative Adversarial Networks
don't give us the ability to
assign probability to
observations.
So, this means that
we can't use them for many
applications of generative
models, such as
outlier detection or
lossless compression.
They also suffer from
so-called mode collapse.
And this is the case when a
model trained on the dataset
ignores some part of the
training data and models only a
subset of the training data,
which is a bit worrisome and not
something that you see with
likelihood-based models, because
they are essentially obligated
to model every datapoint.
And
the other difficulty with mode
collapse is that we don't
actually have control over which
part of the data distribution
will be ignored.
On the other hand,
if you just want realistic
samples from some part of the
data distribution then GANs do
it really well.
And the other
difficulty with GANs is that
optimisation is actually a
saddle point optimisation
problem. And as a result,
training is often unstable and
requires a lot of small tricks
to get it right.
So in this lecture,
we will focus on latent
variable models and inference.
So let's look at this generative
modelling framework.
So a latent
variable model defines an
observation, a distribution over
observations, x, by introducing
a latent variable, z.
Along with z we specify its prior
distribution, as well as the
likelihood, p of x given z, that
connects the latent variable to
the observation. So p of x given z
essentially tells us how to
map a configuration of latent
variable to a distribution over
the observations. And even
though I say a latent
variable, typically z is a
vector or it can be a tensor or
anything like that, conceptually
it doesn't really make much of a
difference.
So, once we have the
prior and the likelihood we have
specified the model completely
and the model is completely
characterised by the joint
distribution, p of x comma z,
which we obtained simply by
multiplying the likelihood by
the prior.
And there are two
distributions that we can derive
from the joint distribution that
will be of interest to us for
latent variable modelling.
So the first such distribution
is p of x, which is the marginal
likelihood of an observation.
And it tells us how probable the
observation is under the model.
And this is the quantity that we
would optimise if we're doing
maximum likelihood learning.
And then there's the posterior
distribution, p of z given x,
and this is the distribution of
plausible latent values that
could have generated the given
observation x.
So we can think of the
latent variable as some
sort of explanation for the
observation.
So how do we
generate observations from a
latent variable model?
It's actually quite simple. We start
by sampling, the latent
variables z that from the prior,
p of z, and then we sample x
from the likelihood distribution
p of x given z, which is
conditional on the configuration
of the latent variable.
And much of this lecture will be
concerned with inference which
is the process of going back
from the observation x to a
distribution over the latent
variable z.
So in this lecture,
inference will specifically
refer to computing the
posterior distribution given the
observation.
So computing p of z given x
How is p of z given x
defined? Well, we simply use the
definition of conditional
probability, which says that p
of z given x is the ratio of the
joint distribution under the
model p of x comma z, divided by
the marginal probability of x, p of x.
So this means that in
order to compute the posterior
distribution, we first need to
compute the marginal probability
of x, p of x, or the marginal
likelihood.
How do we do that?
Well, we do that by starting
with a joint distribution p of x
comma z and marginalising out
the latent variable z.
In the continuous case, it will be
integration so we will integrate
over z, the joint distribution.
In the discrete case, it will be
a summation, but typically in
this lecture I will use
integration.
And now we will see
that inference is in a very
specific, formal sense, the
inverse of generation.
So let's think about two ways of
generating the observation slash
latent variable pairs x/z.
So, one way to generate such pairs
is to start by sampling the
latent variable z from the
prior, and then sampling the
observation from the likelihood.
This is what we've done two
slides ago. This gives us a
distribution of x/z pairs, but
we can also sample x/z pairs in
different way. First we can
sample x from the model
using the same process and then just
discarding the original latent
configuration that led to x.
And now that we have this x
we can perform inference
and sample
a z from the posterior
distribution for this x,
from p of z given x.
This gives us
another way of generating pairs
x and z.
And because the product
of the distributions we are
sampling from in both cases is
exactly the same, it's the joint
distribution p of x comma z,
it means that the distribution of
these pairs is exactly the same.
So, this means that sampling from
the variational, from the exact
posterior is a probabilistic
inverse of generation.
So why is inference important?
Well, inference is important in its
own right because once we've
trained a model, we can use
inference to explain
observations in terms of latent
configurations. So, it might
potentially allow us to
interpret observations in terms
of some latent variable values.
Moreover, as we will see a bit
later, inference comes up
naturally in maximum likelihood
training of latent variable
models. It's a subproblem that
we'll need to solve over and
over in the inner loop of
optimisation.
So let's look at
an example of inference in a
very simple latent variable
model: a mixture of Gaussians.
You have probably seen this
model before, it's perhaps the
simplest latent variable model
you can imagine. So it has a
single latent variable, it's a
discrete one and it takes on K
values between one and K.
The probability of z being i is
simply πi and then each latent
variable value corresponds to a
mixture component, which is
Gaussian.
And the mean, and the
standard deviation of this
Gaussian is determined by the
value of the mixing component.
So we can think of this as
having a vector of means and a
vector of standard deviations
for the mixing component and
then the latent variable simply
selects which dimension of these
vectors we will use to define
the Gaussian.
Let's compute the
marginal likelihood or the
marginal probability of the
observation x. So, as we saw
before, this requires
marginalising out z from the
joint of the model. And the
joint is simply the product of
the prior, p of z, and the
likelihood p of x given z.
Since it's a discrete model we're
performing summation to
marginalise out z by summing
over its values from one through K.
Now that we have computed the
marginal likelihood we can
compute the posterior
distribution because p of z
given x is just the ratio of the
joint probability of x and z
divided by the marginal
probability of x.
And we computed the marginal
probability above and the joint
probability also has a
subproblem there. So now we have
an expression for the posterior
probability of z given x.
As you can see,
we can compute this
posterior distribution in linear
time in the number of latent
variable values. So this model
is clearly very tractable.
Now let's look at maximum likelihood
learning, which is how we would
like to train latent variable
models.
Maximum likelihood is a
very well established estimation
principle for probabilistic
models in statistics.
And the
basic idea behind it is that we
should choose those parameters
of the model that make the
training data most probable.
So this corresponds to maximising
the product of probabilities of
data points in the training set,
or for computational convenience,
we can maximise the
sum of log probabilities of the
data points, because we're
looking for the optimal
parameters rather than the
objective function value, these
two approaches are exactly the
same, they give us the same
parameter values.
Unfortunately
for latent variable models, we
can't solve this optimisation
problem in closed form.
So as a result,
we use various iterative
approaches either based on
gradient descent or expectation
maximisation.
So, let's look at
the gradient of the marginal
log-likelihood for a single
observation.
So, the gradient of
log p of x is equal to, now we
recall that the derivative of
log is the derivative of its
argument divided by the
argument. So here we have the
derivative of the marginal
probability divided by the
probability itself. Then we
expand the marginal probability
in terms of the joint
distribution and
integrate over the latent value z
and we exchange the derivative
and the integral. On the next
line, we replace the derivative
of the joint by the joint times
the derivative of the log
probability of the joint using
the identity in the yellow box.
And this is the same identity we
used on the first line of this
derivation.
Now that we have,
reformulated the integral
that way we can see that we have
a ratio of probability of the
joint configuration x, z divided
by the probability of the
marginal x.
This corresponds to
the posterior distribution
p of z given x
so we write it like that.
And now we can see that
the gradient of the log marginal
probability is simply an
expectation with respect to the
posterior distribution of the
gradients of the log joint.
So, this means that in order to
compute the gradient of the log
marginal probability, which is
what we need for maximum
likelihood estimation, we need
to compute the posterior
distribution somehow. So this is
basically an essential
subproblem.
And the other thing
we can see here is that the
posterior probabilities modulate
the gradient contributions from
the log joint, through the
gradient of the marginal
log-likelihood. So it basically
upweights the configuration z,
that were more likely to
generate this observation and
downweights the configurations
that are less likely. So this
basically means that inference
performs credit assignments
among
latent configurations for the
given observation.
So, unfortunately exact inference is
hard in general.
To see why this
is the case, let's think about
computing the marginal
likelihood of an observation,
which is as we've seen an
important part of computing the
posterior distribution.
So, if our latent variables are
continuous then computing the
marginal likelihood involves
integrating over
high-dimensional space.
And typically the argument will be
integrating over will be
non-linear function. So
analytical integration will not
be an option and numerical
integration, in order to get
reasonable level of accuracy,
will also not be an option
because the complexity of
integration will go
exponentially in the number of
latent variables.
In the discrete case,
the situation is
slightly better because now
instead of integrating over the
latent configurations, we are
summing over a finite
number of them.
So, we know that we could
conceivably enumerate all those
configurations and compute the
marginal probability like that.
But the issue is the same as in
the continuous case, the curse
of dimensionality. So if the
number of latent variables is
more than a handful than the
number of possible joint latent
configurations will be so large
that we will never be able to
compute this sum exactly.
There are some exceptions
where we
have interesting models with
exact inference,
and we've seen
with exact tractable inference.
We've already seen one example,
it's a mixture model where
inference is basically linear in
the number of mixing components.
The other important subclass is
linear Gaussian models. So these
are models with Gaussian latent
variables and linear mappings.
In these models all the induced
distributions are Gaussian and
as a result inference is
tractable.
And finally, we have
the interesting case of
invertible models.
So, these
models are special because
they're actually quite powerful
and yet they allow exact
inference through clever
constraints on their structure.
And we will see these models a
bit later in this lecture.
So, how can we avoid these
intractable computations that
exact inference involves?
Well, there are two general
strategies here.
The first one is simply to
restrict ourself when designing
the model so that the resulting
model will be tractable.
This will give us easier training
because we can do exact maximum
likelihood without any
approximations, but it will make
model design more complicated,
and, in a sense, considerably
restrict the modelling choices we
can make.
On the other hand, if
we're interested in creating a
model that represents our
knowledge about the task, then
we might want to just build the
model with, you know, all the
required properties that we
would like, and then worry about
the inference later. And almost
certainly we will end up with an
intractable model, but that's
okay because there are
approximate inference methods
and we will be willing to
pay the price of using
approximate inference with some
extra complexity that that
entails, but then we will be
able to use more expressive
models.
So, let's look at the
first strategy of working with
tractable models and exact
inference.
So, we will look at
these modern tractable, but very
powerful models, called
invertible models, also known as
normalising flows and they're
special and interesting because
they combine high expressive
power with tractability, which
is rather rare.
And the basic idea
behind these models is
simply starting with some prior
distribution, like in any latent
variable models, and then
applying an invertible function
to it to obtain the observation.
And the parameters of the model
are all incorporated in this
invertible function and by
warping the prior distribution
in various ways, we can
approximate the data
distribution.
So, because the
function is invertible this
constrains the structure of the
model in a very specific way and
makes inference and maximum
likelihood tractable in these
models.
So, let's look at the
generative description of an
invertible model.
So, to specify
an invertible model, we need the
prior distribution as before,
p of z, and here we will assume
that it has no parameters, but
it doesn't make much difference,
this is just for convenience.
And then we use an invertible
differentiable transformation,
f of z, which has parameters
theta, to transform samples from
the prior into observations.
So, all the model parameters here
will be
in this function f.
And because we
use f that's invertible,
having this setup gives us
one-to-one correspondence
between latent configurations
and observations. So there's
absolutely no ambiguity about
which latent configuration
generated the given observation,
because the function is
one-to-one.
So this means that
we can simply compute
the latent
configuration by inverting f and
applying it to x. So we apply f
inverse to the observation and
we exactly recover the only
latent configuration that could
have generated this observation.
So, this is very nice. Inference
is very easy and fully
deterministic.
So, now how do we
compute the marginal likelihood
that we need for maximum
likelihood training?
To do that
we need to somehow relate the
prior probability and the
probability of the observation x.
And it turns out that because
we use an invertible
differential transformation to
connect z to x, we can apply the
change of variables formula and
then the densities, the
probability of p of z and p of x
differ by just a scaling factor.
And this scaling factor is the
absolute value of the
determinant of the Jacobian of
the mapping from x to z.
This might seem a bit
counterintuitive or surprising
where does this factor come
from? And this factor simply
accounts for the fact that when
we apply a function to go from
z to x or x to z, it will change
the infinitesimal volume around
the point where it's
being applied. And so if we want
the resulting distribution to
normalise to one, just like the
original distribution, we need
to take into account that volume
rescaling factor. And this is
exactly what the determinant of
the Jacobian takes into account.
So, we would like to get rid of z
in that expression, because we
want to evaluate probability of
x, just on datapoint x, and we
can get rid of x by remembering
that, we can get rid of z by
remembering that z is simply f
inverse of x. So, wherever we
have z, we replace it with f
inverse of x and now we have an
expression for the probability
of x that makes no reference to z.
So now, conceptually at
least, we can compute the
marginal probability of x and we
can perform maximum likelihood
training.
So, from the practical
angle to do maximum likelihood
estimation, we still need to
have some requirements for f.
So in particular, we need to be
able to compute f inverse of x,
as well as the determinant of
its Jacobian because it's used
in the expression of the
marginal probability of x. And
we also need to compute their
gradients because this is what's
required for maximum likelihood
estimation.
And finally,
these
computations need to be
sufficiently efficient for
maximum likelihood to be fast.
So, let's look at a very simple
invertible model, perhaps the
simplest
and maybe the oldest called the
Independent Component Analysis.
So, this model starts with a
factorial prior. So each latent
dimension is modelled as a
univariate distribution
independently of the other
dimensions. And the latent
values are mapped to the
observation using a square
matrix, A, so this is a linear model.
Since inference in an
invertible modelling involves
inverting f, inference here is
simply multiplying by the
inverse of A so to compute z
from x we simply multiply x by A
inverse.
And once we've trained
such a model, we can use it to
explain our observations in
terms of latent independent
causes that explain the data
linearly.
And the typical
application for this model is
solving the so-called cocktail
party problem where you have,
n sound sources around the room,
for example, people talking, and
then you also have n sensors or
microphones, and you would like
to isolate individual people
from this mixed recording, and
because sound acoustics ensures
that mixing is approximately
linear, this is an
appropriate model. So inference
on recordings from microphones x
will allow us to recover
individual sources z.
And in order for this to identify
independent sources there's an
interesting constraint: the
prior cannot be Gaussian because
Gaussian latent variables are
rotationally symmetric in high
dimensions so we cannot actually
recover independence, we can
only recover decorrelation.
So, typically the prior we use here
is some sort of heavy tail
distribution, like a Logistic or
Cauchy.
So, how do we construct
general invertible models?
Well this strategy is simple because
a combination or composition of
invertible transformations is
invertible, we simply use a
library of simple invertible
transformations and chain a lot
of them together to obtain the
more expressive invertible
transformation.
And here each of these
simple building blocks can
be parameterised either in the
forward direction, mapping from
z to x, or in the reverse
direction from x to z, whichever
one we would like to be more
efficient when using the model.
So depending whether we want
training or inference to be more
efficient, we parameterise the
appropriate mapping.
And one
interesting detail here is that
we don't actually need f to be
analytically invertible.
It is fine if f can be inverted only
numerically with an iterative
algorithm as long as we have a
reasonably efficient algorithm
that recovers the inverse to
numerical precision.
And in terms of building blocks,
there's a rapidly growing list
of them. This is an active area
of research, and I give a few
examples there on the slide.
So
invertible models are very
appealing because they are both
powerful and they are tractable
so easy to train.
So why don't we
use them all the time?
Well they do have a number of
limitations which make them not
always appropriate.
So, one obvious limitation is that the
the dimensionality of the
latent vector and of the
observations has to be the same.
And this is a consequence of
requiring the function f to be
invertible, there's no way
around it.
So, if we would like a
lower-dimensional latent space
for some sort of a
low-dimensional representation
of the observation, we simply
can't easily do this with an
invertible model.
The other
requirement is that the latent
space has to be continuous.
And this is because
we use change of
density to compute the marginal
probability of x. There has been
some initial work on discrete
flows, so this limitation might
be relaxed in the future.
The other consequence of using
continuous latent variables and
applying invertible
transformation to them is that
it makes it hard to model
discrete data because the output
of such a transformation will
also be a density, so unless our
observations are continuous or
quantised, which means that they
were discretised based on some
underlying continuous
distributions, we can't really
apply invertible models to such
data.
And because the models are
constructed by chaining a lot of
simple transformations together,
the resulting models tend to be
quite large in order to have
high expressive power. So this
means that we will need to store
a lot of activations and
parameters, which makes it easy
to run out of GPU memory when
training such models. So in
terms of expressiveness per
parameter, or per kilobyte of
memory, these
models are less expressive than
more general latent variable
models.
And finally compared to
general latent variable models,
it's hard to incorporate
structure in invertible models
because we have to retain
invertibility.
So that removes a
lot of options for model design.
On the other hand, because
invertible models are tractable
and powerful, they make very
useful building blocks to
incorporate into other models,
in particular intractable latent
variable models. They provide a
very useful abstraction that
basically gives you a
distribution that can be trained
exactly and gives you the exact
marginal likelihood. So that
makes them very composable and
appealing as building blocks.
In the second half of the lecture,
we will look at intractable
models and variational inference
as a way of training them.
So
why would we want to use
intractable models?
Well, sometimes the structure of the
model, or its latent variables,
have some sort of intrinsic
meaning for us. We might be
modelling some real world process
and the underlying quantities
have some grounded meaning.
And we would like to structure the
model in a particular way that
captures that.
So, this is different
from thinking of a
model as just some sort of black
box that produces predictions or
merely generates samples, so we
want some sort of interpretability.
Then the basic
question is, and I like this
quote from David Blei: 'Do you
want the wrong answer to the
right question or
do you want the right answer to
the wrong question?'
And this basically highlights a dilemma
we have, do we want to use the
right model with approximate
inference or potentially the
wrong model was exact inference.
And in many situations when we
take modelling quite seriously,
it makes sense to go for the
wrong answer to the right
question.
So, in many cases we
will end up with an intractable
model that captures our desired
properties, and we will just
have to use approximate
inference.
So, here's an example
of how easy it is to end up with
an intractable model even though
the starting point is tractable.
So as we've seen, the ICA model
with the same number of latent
dimensions as observation
dimensions is tractable, it's a
very simple linear model. So
what would happen if we change
this model slightly. Suppose we
would like to model a bit of
observation noise to indicate
that our microphones are not
perfect. So, adding observation noise
to the model makes the model
intractable because the mapping
is no longer invertible.
If we use more latent
dimensions than observations,
the model once
again becomes intractable.
And even if we use fewer dimensions
than observations of duration
dimensions, the model becomes
intractable once again. So, it
really doesn't take much to go
from a simple tractable model to
an intractable one.
And once we
have an intractable model in
order to use it or train it, we
need to use
approximate inference and there
are two broad classes of
approximate inference.
The first class
is Markov Chain Monte
Carlo methods. And here we will
represent our exact posterior
using samples from it, but using
exact samples. And to obtain an
exact sample from the true
posterior we set up a Markov
chain, which we run for quite
some time. And at some point it
converges to the right
distribution, which is the true
posterior, and then the sample
from it is a sample from the
true posterior.
So, the advantage
of this method is that it's very
general.
We really don't need to
restrict our model essentially
in any way, we can use Markov
Chain Monte Carlo for inference.
And this method is also exact in
the limit of essentially
infinite time and computation.
So, if we spend enough time
generating samples, they will be
from the right distribution, if
we generate enough samples, we
will basically have our answer
to the arbitrary degree of
precision. So, in some senses
it's the gold standard for
inference.
Unfortunately, in practice, it's very
computationally expensive and so
doing Markov Chain Monte Carlo
is not really an option in many
cases. Also convergence,
actually knowing when we are
sampling from the right
distribution, is really hard to
diagnose.
So, often we just wait
for some time until we're tired
of waiting and then we use the
sample at that point, hoping
that it's from the right
distribution. But doing this can
actually introduce a subtle
error because it might still not
be the true posterior that we're
sampling from and we have no way
of quantifying or controlling
for this.
So, the other class of
approximate inference methods is
variational inference. And here,
the idea is rather different.
Instead of sampling from the
true posterior in some free
form, we say we will approximate
the true posterior with a
distribution with some
particular simple structure.
So, for example, we will say, we
will approximate the true
posterior with a factorised
distribution which models each
latent dimension independently.
So, and then we fit this
approximation to the true
posterior using optimisation.
The advantage of this approach
is that it's much more efficient
than Markov Chain Monte Carlo as
optimisation is generally more
efficient than sampling.
On the other hand,
we cannot trade
computation for greater accuracy
as easily because once we've
chosen the form of this
posterior approximation, once
we've converged running for
longer doesn't give us any more
accuracy.
But unlike in Markov
Chain Monte Carlo, we have
something that guarantees that
we are performing reasonably
well at every point, because we
have a bound on the marginal
log-likelihood, so we can
essentially, at least
hypothetically, quantify the
approximation error.
So let's look at variational
inference in detail.
So the one-line
description of variational
inference is: it turns inference
into an optimisation problem.
And it's called variational
because we're essentially
optimising over a space of
distributions
and as a result, we are
approximating some unknown
posterior distribution with a
distribution from some
particular family.
And the
distribution that will be
approximating the exact
posterior will be called the
variational posterior, we will
denote it as q of z given x,
and it will have parameters phi,
which are called the variational
parameters. And they're there
just to make sure that our
variational posterior
approximates the true posterior,
p of z given x, as accurately as
possible.
And what are the
restrictions on the choice of
the variational posterior? Well,
our hands are pretty much free
as long as we can sample from
this distribution and we can
compute the probabilities or
log-probabilities under it and
the corresponding parameter
gradients that we need in order
to fit this distribution to the
true posterior.
So a classic and
default choice is simply using
the fully factorised
distribution q, where each
dimension is modeled
independently from all others.
Variational inference allows us
to train models by approximating
the marginal log-likelihood,
which in itself is intractable
because the model is
intractable, so we can't compute
the marginal log-likelihood. But
by introducing this simplified
form of the variational
posterior allows us to define an
alternative objective, which is
closely related to the marginal
log-likelihood. And this
objective is a lower bound on
the marginal log-likelihood.
And we train the
model by optimising this lower
bound with respect to the
parameters of the model phi and
the parameters of the
variational posterior phi, sorry
parameters of the model theta
and the parameters of the
variational posterior phi.
And because this is a lower bound,
it's guaranteed to be below the
value of the marginal
log-likelihood, so when we
maximise the lower bound we are
usually also pushing up the
marginal log-likelihood, even
though we can't actually compute
it exactly.
So, how do we obtain
this variational lower bound on
the marginal log-likelihood?
So, let's consider any
density p of z,
the only requirement is that
this density is non-negative
whenever the prior distribution
is non-negative.
Then we start
by expanding the marginal
log-likelihood in terms of the
joint distribution, where we
integrate over the latent
variable.
And then we introduce
this density that we chose by
both multiplying and dividing
the model joint by it.
So this doesn't do anything because
multiplying and dividing by the
same quantity has no effect, but
once we've done this, we can
apply the Jensen inequality
which states that the log of the
expectation of some function is
always greater than or equal
than the expectation of the log
of this particular function.
So, this allows us to push the log
inside the integral and take the
integral with q outside the log.
And we know that the resulting
quantity is
less than or equal to the
preceding quantity because of
the Jensen inequality. And now
we recognise that this new
expression is simply the
expectation with respect to this
distribution q that we
introduced of log-density ratio
between the joint distribution,
p of x z, and this density.
And the important thing to recognise
is that because this density q
that we used in this derivation
is arbitrary, for any setting of
parameters of the density phi,
we will have a lower bound on
the marginal log-likelihood.
Which basically allows us to get
as tight of a bound as possible
simply by maximising this
expression with respect to the
parameters phi. And thus getting
a closer approximation to the
marginal log-likelihood.
So, there are several possible
variational lower bounds. And in
this lecture, we will focus on
essentially the bound with
derived on the previous page,
where instead of the arbitrary
density q, we will use the
variational posterior q of z
given x.
And this is both the
simplest and by far the most
widely used variational bound.
So this is the bound you will
see in most variational
inference papers.
There's a more
recent option called the
Importance-Weighted Lower Bound,
also known for historical
reasons as IWAE.
And this is
simply a multisample
generalisation of the Evidence
Lower Bound. And its interesting
feature is that it allows you to
control the tightness of the
bound.
The accuracy of
approximation to the
marginal likelihood by
increasing the number of samples
you use in the bound. So this is
not quite as flexible as Markov
Chain Monte Carlo, where you use
more computation to get more
accurate results because the
scaling is, you know, you get
rapid improvement as you go from
sample to ten samples, but once
you go beyond that, the
improvement quickly levels out,
but still you can get some easy
gains without changing the form
of the variational posterior.
But for simplicity, we will use
the ELBO in the rest of this
lecture.
So, let's review a
concept important for
variational inference and this
concept is Kullback-Leibler
divergence.
KL divergence
provides us with a way of
quantifying the difference
between two distributions
and KL divergence
between q and p is
defined as the expectation under
the distribution q of the
log-density ratio of q to p.
And it has a few important
properties we will need for the
rest of the lecture. So, first
of all, the KL divergence is
non-negative for any choice
of q and p.
The KL divergence is zero
if and only if q and p are the
same almost everywhere. So we
can basically think q and p have
the same distribution is the
only case when the KL divergence
is zero.
And finally, it's
important to remember that KL
divergence is not a metric, so
it's not symmetric in its
arguments.
So KL from q to p
is not the same as KL from
p to q in general.
So, let's look
at optimising a variational
lower bound with respect to the
variational parameters, phi of
the variational posterior q.
So, let's start by rewriting the
ELBO. So in, on the first line,
we factor the joint distribution
into the marginal probability of
x and the posterior probability
of z given x.
This is just
another factorisation of the
joint density of the model.
On the next line, we simply take
out the term for the marginal
log-likelihood into the first
term, and then keep the rest as
the second term, giving us
expectation under q of the
log-density of the true
posterior to the variational
posterior.
Now, in the first
expectation on that line, we see
that log p of x actually does
not depend on z so its
expectation under the
variational posterior is just
itself.
So log p of x.
And then we recognise
the second quantity,
the second expectation,
simply as the minus
KL from the variational
posterior,  q of z given x, to the
true posterior, p of z given x.
So, let's look at that
decomposition of the variational
lower bound. So we have two
terms, the marginal
log-likelihood and the KL.
So, the marginal log-likelihood
depends on the model parameters
theta, but it does not depend on
the variational parameters phi.
So, when we maximise the
variational
lower bound to the, with respect
to the variational parameters
phi, the first term is
unaffected.
Therefore maximising
the ELBO with respect to
variational parameters is the
same as minimising the KL
divergence from the variational
posterior to the true posterior
and this KL from the variational
posterior to the true posterior
quantifies the distance from the
variational posterior to the
true posterior and it is known
as the variational gap because
we can express it also as the
difference between the marginal
log-likelihood, log p of x, and
the variational bound, L of x.
So, this means that when we are
maximising the ELBO with respect
to the variational parameters,
we're actually minimising the KL
divergence from the variational
posterior to the true posterior.
So we're making sure that
variational posterior is a better and
better fit to the true
posterior.
This is actually remarkable
because this is a
model which is intractable, so
we can not actually compute the
true posterior at all and we
can't even compute this KL
divergence from the variational
posterior to the true posterior
because it involves the true
posterior, which we can't
compute in the first place.
So, if we look at that decomposition
of ELBO from the previous slide,
as the difference between the
log marginal likelihood and the
KL from the variational to the
true posterior, we realise that
the ELBO is actually a
difference between two
intractable quantities
and yet it is tractable.
So, it means
that both of these quantities
are intractable in the same way
so they have this intractable
part that's exactly the same and
when we take the difference
between them it cancels out.
Also looking at this
decomposition and remembering
that the KL divergence is
non-negative, and it's zero if
and only if the two
distributions are effectively
the same, it means that the best
value of the variational lower
bound we can get is actually the
same as the marginal
log-likelihood, log p of x and
that happens when the KL is
zero.
And this can only happen
if q is a very expressive
distribution that can
approximate the true posterior
exactly. So, that's good for
understanding variational
inference, but in practice, it's
not going to happen with a
variational model.
Now let's think about
maximising the
variation of bound with respect
to the other set of parameters,
the model parameters, what
happens when we update these
parameters to increase the
variational lower bound?
Well, looking at the same
decomposition,
we see that
well, either the first term,
the marginal log-likelihood will
increase or the second term will
have to decrease. That's the
only way to get the increase in
the variational lower bound.
So, let's look at the first option.
When we update the parameters
and the marginal log-likelihood
increases, this is good because
this is the same as what maximum
likelihood learning
parameter update does, we're
increasing the marginal
log-likelihood.
But what happens when
the variational lower bound
is increased because we actually
decreased the variational gap?
Well, there are two ways of
decreasing the variational gap.
So, we've seen the first one a
couple of slides ago when we
were updating the variational
parameters. And because that was
equivalent to minimising the KL
from the variational posterior
to the true posterior that was
decreasing the variational gap
as well. And doing this was
clearly good because we were
getting a better and better
approximation for the
variational posterior of the
true posterior.
And the model
was not affected by these
updates because the model is not
affected by the variational
parameters.
On the other hand,
now if we update the model
parameters and the variational
gap decreases it means that the
model has changed.
So, the way in
which it changed there are two
possibilities. So first of all,
the inference in this model,
variational inference of this
model, did become more accurate
because the variational
posterior remained the same, but
the true posterior moved towards
it so now they are closer
together. But when this happens,
this is actually not always
desirable because it means we're
spending some of the model
capacity to actually approximate
the variational posterior rather
than to model the data.
So, in a sense, the model
is trying to contort itself, so
that inference in it is easy.
And if we only have so much
capacity in the model, it will
probably make it less good of a
model of the data.
So, this means
that if we are worried by such
effects, if we would like to
have as faithful approximation
to maximum likelihood as
possible, we should use
as expressive of a variational
posterior as possible because
this will reduce the variational
gap and there will be less of a
pressure for the model to
distort itself like that.
And
one particular manifestation of
this effect in the models
trained using variational
inference is called variational
pruning.
And this is when the
model refuses to use some of the
latent variables, so they're
essentially not used to generate
the data, which means that their
posterior and their prior are
exactly the same.
And when I say
posterior, I mean, both the true
posterior and the variational
posterior, because when the
model is unused, its true
posterior is the same as the
prior and it's very easy to
approximate with the variational
posterior. And this is in fact
why variational pruning happens
because when you prune out some
variables it becomes easier to
perform variational inference.
So, there's this extra pressure
on the model to be simpler in
that way.
And variational pruning
is also known as
posterior collapse in the
variational autoencoder literature.
So, is variational pruning a good
thing or a bad thing?
Well, it depends how
you think about it. In some
circumstances, it can be a good
thing because you can think of
it as choosing the
dimensionality of the latent
space automatically based on
your data distribution.
On the other hand it gives away,
it takes away some of our freedom
to overfit to the data.
So sometimes in deep learning you
would like to have a very
accurate model of the training
data even if you are, when you
are not concerned with
overfitting and you can easily
achieve this by giving the model
many, many hidden units, so
making the hidden layers wider,
and then you are guaranteed to
overfit to the data often
driving like classification
error to zero. Well, if you're
training a generative model and
you would like to achieve
something similar, overfitting
to the data arbitrarily well by
giving it lots and lots of
latent variables, well if you're
using variational inference,
the model will actually refuse
to use extra variables after some
point and the number of
variables it will use can be
surprisingly small and sometimes
it's clearly suboptimal. So you
would like the model to use more
variables, but because the
variational posterior is too
simple compared to the true
posterior it will simply
disregard the rest of the latent
variables.
And how do we choose
the form of the variational
posterior?
Well, the default choice,
as I mentioned before,
is
a fully factorised distribution,
with each dimension modelled
independently, and this form is
known as the mean field
approximation for historical
reasons, because the method
originated in physics.
We can make
the variational distribution
more expressive and
we have several choices for
doing that.
So, one possibility
is to use a mixture model.
So instead of a unimodal
distribution we will have a
multimodal distribution now.
If we're using a variational
posterior that's a diagonal
Gaussian, which is a very common
choice, we can introduce a
richer covariance structure, so
we can, for example, have a low
rank or full covariance Gaussian
as the variational posterior.
We can make the variational
posterior autoregressive which
will make training more
expensive, like many of the
other choices, but will provide
much more modelling power.
Or alternatively we can take an
invertible model and use it to
parameterise the variational
posterior as well.
And this works
very nicely because invertible
models are tractable.
And ultimately we're making
this tradeoff between the
computational cost of training
the model and the quality of the
variational approximation and
perhaps fit to the data on the
other hand.
Some of these
choices for the more expressive
posteriors also have some
practical downsides because you
might run into numerical
instability problems.
So you have to be careful
and watch out for that.
And sometimes when you
use a richer variational
posterior you actually get worse
results.
And this should not happen in
theory if optimisation is
perfect but due to various
stability issues and learning
dynamics issues, this can
actually happen.
Alright, so
let's think about what we're
doing when we're fitting a
variational distribution.
So, first of all, the posterior
distribution of course is
different for every observation
x because,
each x is generated
by some latent configuration
that's more probable than
others, so we have a
distribution over plausible
explanations for x.
This means that we need to
fit a different
variational posterior for each
observation and in classical
variational inference this means
that we simply have a separate
set of distribution parameters
for each observation that we
optimise over.
And this also means
that we perform a separate
optimisation run for each
data point, whether it's a
training observation or a test
observation
to fit the
corresponding variational
parameters.
This can be
inefficient because basically
the, we learn nothing from
fitting operational parameters
for one datapoint about all the
other data points.
So, we can actually
amortise this cost by
replacing this separate
optimisation procedure for each
datapoint with some sort of
functional approximation. So we
will train a neural network that
will take the observation and
output an approximation to its
variational parameters.
And we will
train this network, which
we'll call the inference
network, basically to serve as
the approximation to all those
independent variational
posteriors we were training
before.
And as a result now, instead of
performing a potentially costly
iterative optimisation for each
data point to obtain its
posterior, we simply perform a
forward pass in the inference
network that gives us the
variational parameters and these
are the ones that we use for the
variational posterior.
So, now we
replaced all these independent
variational parameters that were
data point specific with a single
set of neural network parameters
that are shared between all
observations. And we amortise
the cost of solving these
optimisation problems among all
observations.
So, once we've
trained such an inference
network, we can compute the
variational posterior for a new
datapoint simply by feeding the
datapoint to the network and it
will produce the corresponding
variational posterior.
So, this is
a very powerful idea because
it allows us to easily scale up
variational inference to much
bigger datasets and models than
before. And this idea of
amortised inference was
introduced in the context of
Helmholtz Machines in the mid
nineties and it was popularised
recently by Variational
Autoencoders that rely on it.
And, as mentioned before, the
variational parameters are
trained jointly with the model
parameters simply by maximising
the ELBO with respect to both.
And now we basically have two
sets of neural network
parameters, one for the model
and one for the inference network.
Let's step back and
think about what we gained and
what we gave up by performing
variational inference.
Well, now we can train
intractable models in a
principled way and relatively
efficiently.
This lets us choose
any kind of model we want and
incorporate any kind of prior
knowledge into the model.
So that's great from the modelling
standpoint and inference is
quite fast, especially if we use
amortisation compared to MCMC
methods. So some models are
simply infeasible for MCMC and
variational inference makes it
possible to train them.
And what did we lose?
Well, we do
typically give up some of the
model capacity because we are
not using expressive enough
variational posterior.
But perhaps that's fine because
essentially in many cases,
variational inference is the
only option for training a model
this large on a dataset of a
particular size.
So ,we either
have a slightly suboptimal fit,
or we have to resort to a much
simpler model.
So, we saw that
training a model using
variational inference requires
computing the gradients of the
variational lower bound with
respect to the model parameters
theta and the variational
parameters phi.
Well, the ELBO
is actually an expectation so
computing gradients of an
expectation might not be so
straightforward so let's look at
how we can do this.
Well, in
classic variational inference
the expectations were typically
computed in closed form and then
optimisation did not involve any
kind of noise in the gradient
estimates because the objective
function was analytically
tractable.
On the other hand,
to actually have
expectations that you can
compute in closed form required
models to be very simple,
as well as the variational
posteriors to be generally fully
factorised because otherwise
you couldn't compute the
expectations.
So variational inference
in its classic form
was applicable to only a small
set of models.
On the other hand,
recent developments in
variational inference replaced
exact estimation of the
gradients with Monte Carlo based
estimation.
And here we don't
try to compute the expectation
or its gradients in closed form,
instead we use Monte Carlo
sampling from the variational
posterior to estimate it. And that
gives us much more freedom in
terms of what kinds of models we
can handle. And the answer is
essentially we can handle almost
any kind of latent variable
model.
So, let's look at how we
can estimate the gradients of
the ELBO with respect to the
model parameters.
This is actually the easy case.
So,
expanding the definition of the
ELBO there, we see that only the
joint distribution of the model
depends on the model parameters
inside the expectation and the
variational posterior does not
depend on it.
Also the
expectation the ELBO involves is
an expectation with respect to
the variational posterior which
does not depend on the model
parameters, this means we can
safely move the gradient inside
the expectation and this means
that the gradient of the ELBO
with respect
to the model parameters is
simply the expectation under the
variational posterior of the
gradient of the log-joint for
the model.
And this quantity is
really easy to estimate.
We simply sample from the
variational posterior, evaluate
the gradients of the log-joint
based on the resulting samples
and then we average them.
And in practice,
even one sample can be
enough to train a model.
So, one
thing to mention here is since
we're using sampling to estimate
gradients, there is some noise
in the gradient estimates.
And
basically gradient estimate
noise can be a bad thing because
it prevents us from using larger
learning rates. So if the noise
level is too high, we have to
use a sufficiently low learning
rate to avoid divergence, which
makes training models slower. So
generally we would like to have
gradient estimates that are
relatively low variance.
Increasing the number of samples
we take is an easy way of
reducing this variance.
Now, let's look at
the case of the
gradient for the variational
parameters.
This is a more
complicated situation because
now the gradient we're computing
involves the parameters of the
distribution the expectation is
over. So we can't simply take
the gradient inside the
expectation because this will
result in incorrect estimates.
So, what do we do here?
Well, it turns out
that gradients of
expectation of this form,
computing them is a well known
research problem and there are
several
good methods for estimating
these gradients available.
So, let's look at the two major
types of unbiased gradient
estimators of such expectation.
So, here we will look at the
general case of an expectation
of a function f. In variational
inference this f will be just
log-density ratio of the joint
to the variational posterior.
So, the first type of the gradient
estimator is called REINFORCE or
likelihood-ratio estimator.
And
it's very general, so it can
handle both discrete and
continuous latent variables and
it does not place any stringent
requirements on the function f
that it can handle, so f be
non-differentiable. So that's
nice, it's a very general
estimator.
The price to pay for
this is that the resulting
gradient estimates are
relatively high variance.
So, unless you perform some
additional variance reduction in
almost all practical situations
you need to use an extremely
tiny learning rate, so this is
essentially infeasible. So using
REINFORCE without variance
reduction is essentially
hopeless.
The other type of
estimator is called
reparameterisation or pathwise
estimator.
And this estimator is
considerably less general.
It requires us to use continuous
latent variables and it supports
only some continuous latent
variable distributions but the
class is quite large.
It also requires
the function inside the
expectation to be differentiable
but this is fine because in
variational inference this is
typically the kind of function
that we get.
But the big advantage
of this estimator
is that out of the box it gives
you fairly low gradient
variance. So, you don't need to
worry too much about variance
reduction and you can still
estimate the gradients with
sufficiently low variance and
train the model sufficiently
quickly.
So, let's look at the
reparameterisation trick which
is essentially how pathwise
gradients are known in the
modern machine learning
literature.
And the high level idea
here is simply to take the
parameters of the distribution
the expectation is with respect
to and somehow move them outside
the distribution and inside the
expectation.
And once we've done that,
we're in the same situation
as for the gradient of
the model parameters for ELBO
because now the distribution of
the expectation will not have
the parameters we're
differentiating with respect to
so we can just take the gradient
inside.
So how do we achieve this?
We do this by
reparameterising samples from
the distribution p of z.
And we do that
by thinking of them as a
transformation of samples from
some fixed distribution with no
parameters, we will call the
samples ε. And then we
will apply some deterministic
differentiable transformation to
it, we will call g, that will
incorporate the dependence on
the parameters into the sample.
So, epsilon that comes from
p of epsilon ε
does not depend on any
parameters but once we transform
it using g epsilon phi, z now
depends on the parameters phi
through this function g.
So, we factored out
the randomness
from the samples and the
parameters into separate boxes.
So, now that we've done this
factorisation, we can rewrite
the expectation of f with
respect to distribution q in
terms of g.
So, now we replace z
inside as the argument of f with
g epsilon phi, because that's
how we compute z.
And because we
generate z by first sampling
from p of epsilon, now the
expectation is with respect to
epsilon and rather than z.
So, now the expectation is with
respect to distribution that
does not depend on the
variational parameters so we can
now safely take the gradients
with respect to Φ inside the
expectation.
And now we compute
the gradient of f of g with
respect to phi by using the
chain rule and remembering that
g of epsilon phi is simply z.
So, then we evaluate the gradient of
f at z where z is equal to
g epsilon phi  and then multiplied
by the gradient of sample z as a
function of parameters phi.
And this expectation
has the same
form as the gradient of the ELBO
with respect to the model
parameters so we can estimate it
by sampling from the
distribution p epsilon and
averaging the gradients over the
samples. And we get a low
variance gradient estimate like
that.
So as I explained before,
reparameterisation trick
essentially moves the dependence
on the parameters of the
distribution from the
distribution itself
into its samples and thus
inside the expectation.
The main requirement here
is that the
resulting mapping that takes
epsilon to z has to be
differentiable with respect to
the parameters phi because when
we factor out the randomness in
the parameters into two separate
bits we're essentially
propagating gradients through z
and into the function and its
parameters.
So, let's see how we
can reparameterise the
one-dimensional Gaussian random
variable z that comes from a
distribution with mean mu and
standard deviation sigma.
Well, if we started with a standard
normal epsilon we can scale it
by sigma and then add the mean
mu and then we get exactly the
right distribution for z.
So,
we can see that the mapping
that we use,
mu plus sigma epsilon is
differentiable with respect to
both mu and epsilon so it
satisfies the requirements of
reparameterisation. So this is a
valid reparameterisation. And
this is how Gaussians are
reparameterised in practice.
So, what about other distributions?
So, many distributions such as
those in the location-scale
family, such as Laplace and
Cauchy, can be reparameterised
in exactly the same way
or some simple generalisation
of this approach.
For some other
continuous distributions, such
as Gamma and Dirichlet, there's
actually no way to factor out
randomness out of parameter
dependence so we can't separate
these things in two. But there is a
generalisation of
reparameterisation called
Implicit Reparameterisation that
still allows us to propagate
gradients through samples from
such distributions.
On the other hand,
there are some continuous
distributions that cannot be
reparameterised and all discrete
distributions can not be
reparameterised for the simple
reason that even though we can
factor out randomness and
parameter dependence, the
function that we end up with is
not differentiable. So, applying
reparameterisation trick will
not give us the right gradients.
The good news is if you want to
use reparameterisation for
continuous distributions modern
deep learning frameworks, such
as TensorFlow and PyTorch,
implement this for you. So all
you have to do is to pass the
flag that you want your sample
reparameterised when you're
generating it from one of the
standard distributions and
automatic differentiation will
take care of everything.
So, implementing variational
inference this way now is very
easy.
So, now let's look at
perhaps the most successful
application of variational
inference in recent years and
that's Variational Autoencoders.
So, Variational Autoencoders are
simply generative models with
continuous latent variables
where both the likelihood,
p of x given z,
and the variational
posterior are parameterised
using neural networks.
Typically the prior
and the variational
posterior are modelled as fully
factorised Gaussians and VAEs
are trained using variational
inference by maximising the
ELBO, using both amortised
inference and the
reparameterisation trick.
And this combination
of using expressive
mappings for the likelihood and
the variational posterior and
amortised inference
and reparameterisation
made VAEs very popular because
they are highly scalable
and yet expressive models.
So, let's look
at a slightly more detailed
description of a Variational
Autoencoder.
So, we start with
a prior, p of z, which is
typically a standard normal, and
then our decoder, which is
another term for likelihood in
VAE speak, will simply be
either a neural network
computing the parameters of a
Bernoulli distribution if we are
modelling binary data or a neural
network computing the mean and
the diagonal variance of a
Gaussian distribution if we are
modelling real valued data.
And for the variational posterior,
once again, we use a neural
network that outputs the
parameters of the variational
posterior after taking the
observation x as the input.
And the type of the neural network
we use to parameterise these
models doesn't really matter, it
doesn't change the mathematical
structure of the model.
So, you can easily
use ConvNets, ResNets
or any kind of neural network
you would like.
And, when
training VAEs, the ELBO is
typically written in a slightly
different way from the one that
we've seen before. So the ELBO
is decomposed into two tractable
terms this time.
So, the first term
is the expectation over the
variational posterior of
log p of x given z, so this is
log-likelihood.
And the second term
is just minus the KL
divergence from the variational
posterior to the prior,
because, here,
the second argument
is the prior rather than the
true posterior, this can
actually be computed, and in
fact, this is often computed in
closed form which is easy to do
for a distribution such as
Gaussians.
So, the first term
essentially measures how well
can we predict or reconstruct
the given observation after
sampling from it's variational
posterior.
And this term is
typically known as the negative
reconstruction error. So,
high values of it are good.
The second term,
we can think of it
as a regulariser that pushes the
variational posterior towards
the prior to make sure that we
put not too much information
into the latent variables in
order to reconstruct the
observations well.
And this KL
is essentially an upper bound on
the amount of information about
the observation we have in the
latent variables under the
variational posterior.
So,  the VAE model
has been around for
quite a few years and it has
been extended in many, many
ways. So, now it's really more of
a framework than an actual
model. So, the VAE framework generally
means that this is a model with
continuous latent variables
trained using amortised
variational inference and the
reparameterisation trick.
And the extensions that have been
discovered for VAEs are
numerous. So, for example, here
I covered only a single latent
layer, well you can have
multiple latent layers, you can
have latent variables that are
non-Gaussian. You can have much
more expressive priors and
posteriors. So, for example you
can use invertible models for
both. You can use richer neural
networks, for example, ResNets,
or you can have autoregressive
likelihood terms so that you
combine some of the properties
of autoregressive models with
latent variable models. And
people have also worked on
improving variational inference
either by making it slightly
closer to classic variational
inference instead of one-shot
making it slightly iterative
where you do only a couple of
updates. And also people have
worked on variance reduction in
order to get lower variance
gradients so we can train the
models faster.
So, to conclude
this lecture has covered two
modern approaches to powerful,
latent variable models which are
both based on likelihoods.
And they make rather different
decisions about what's important
whether it's exact inference or
freedom in model design.
And this classification of models
into these different types is
useful for presentation purposes
but some of the most interesting
work is actually about combining
models of different types which
allows you to basically take
advantage of their complimentary
strengths. So, I mentioned, for
example, using autroregressive
decoders in Variational
Autoencoders, you can also use
autoregressive posteriors
and so on.
And you get the extra
modelling power of autoregressive
distributions and yet you still
retain potential
interpretability with latent
variables. And what's exciting
about this area is that it's
still relatively new
and developing very rapidly.
So, there are many substantial
contributions that remain to be made.
