So today we'll be finishing up some of the things that we didn't have
time to cover last time on optimization based meta learning,
and then we'll also be covering non-parametric approaches to few shot learning.
Uh, before you get started with the technical content, a few, uh, reminders.
So Homework 1 is due on Wednesday and Homework 2 is coming out this Wednesday.
Homework 2 will include both uh,
you'll both implement MAML uh,
as we uh, as we covered last week.
And you'll also be implementing prototypical networks,
which is a uh, non-parametric approach to few-shot learning. We'll talk about it today.
Um, posted uh, a form for you to fill out on the poster preferences.
Uh, this is preferences for the date of the poster session. Please fill this out.
I think that last time we checked only like five people had filled it out.
Uh, we're trying to schedule a venue for the poster session now and the- we have
a better venue that's booked for Tuesday the 3rd but we wanna make
sure that people are- are available that day,
um, being it's not during the normal course session time.
So please fill out that form.
Um, we also posted details on the course project and instructions uh,
including instructions for the proposal,
the milestone and the final uh,
project and poster session.
Uh, so please take a look at that.
We also posted some suggestions that were from the broader AI community on Piazza uh,
and the proposal which is the first part of the project is due on Monday October 28th,
but we encourage you to get started early and find uh,
to either figure out what you wanna do before that alone or to find a group to work with.
Okay, um, so the plan for today,
uh, first we'll cover Optimization-Based Meta-Learning, basically recap
what we covered last week
very briefly and then discuss some more advanced topics.
Uh, then we'll cover non-parametric few shot learning.
Uh, this will probably be the bulk of the lecture um,
and this will include things like Siamese networks,
matching networks and prototypical networks as well as some other hybrid approaches.
Uh, And then lastly we'll be
covering properties of meta-learning algorithms and basically
how we can think about comparing the classes of approaches that we've seen so far uh,
and the types of things that we might want when developing new meta-learning algorithms.
Okay. So that's kind of a- a summary of what we'll go over today.
Uh, so first let's recap from last time.
So we talked about how fine tuning is a very effective way for uh,
leveraging information from previous data sets.
Uh, by pre-training the parameters on those data sets and then fine tuning by running
gradient descent or- or
your favorite optimizer to reuse those features that were learned.
Uh, for learning on your new data set.
Uh, and then we talked about well, uh,
can we think about how we might go about the pre-training process in a principled way.
Especially if we want to be able to fine tune with
very small amounts of data at test time.
Uh, and then we talked about one way to do that.
So one way to do that is basically by trying to embed this fine tuning procedure into uh,
the meta-learning process by taking this fine tuning procedure evaluating how
well the resulting fine tune parameters do well on held-out data uh,
then simply optimizing this objective with respect to a set of
pre-trained parameters across a wide range of tasks.
Okay. Um, so kind of, uh,
this was the- the model agnostic meta-learning algorithm that we discussed.
Uh, and this optimizes for an effective initialization for fine tuning.
Uh, we also discussed how well this
performs on extrapolated tasks and we found that it works uh,
quite well in comparison to the black-box adaptation approaches.
Uh, and we also looked at the expressive power and showed that,
uh, the expressive power of these models,
of these algorithms um,
is quite substantial, if you have a deep enough neural network.
Um, and- and kinda requires a bit more, um,
a bit more expressive power than architecture in comparison to black-box approaches.
Okay, Um, so now uh,
I'd like to talk a bit about some other properties of
these kinds of algorithms and different ways that we could extend the algorithm uh,
to address various challenges.
So uh, first, uh,
one- one thing that we had talked about actually
towards the beginning is how you can view uh,
meta-parameters Theta as serving as a prior for task-specific adaptation.
Where kinda this prior is encapsulating the um,
is encapsulating the knowledge in your meta training data set.
So can we make this more formal?
So it turns out we can actually uh,
make a bit of a deeper connection than just saying that it's gonna form,
it's gonna kinda form as loose prior as an initialization for fine tuning.
Uh, in particular to see this,
let's look at the following graphical model.
Uh, so Theta is representing our-
our meta parameters and phi I is denoting the task specific parameters.
Uh, the parameters for each task,
and then the shaded circles representing the data points that we have for
each task and it's shaded because we could
observe those during the meta training process.
Uh, now, uh, if you think about how um,
how you might go about uh,
doing inference in this graphical model,
Uh, we assume that we have uh,
this data set that we want ma- maximize the likelihood
of our data set given our meta parameters.
Uh, this is essentially how you go about doing inference in
this graphical model with respect to the meta learning parameters.
Okay. Um, and you can also write this out as kind of a sum of log likelihoods as well.
Uh, and then from there,
we can introduce this second introduce our task specific parameters phi I.
Uh, these are gonna be integ- these are integrated out uh,
because we- we're trying to optimize over our meta parameters Theta.
Uh, and so kind of we're just expanding this expression uh,
for p- probability of the data given in the meta parameters into probability of
the data given the task specific parameters and the
probability of the task specific parameters given Theta.
Uh, second time is representing the prior that Theta
is enforcing or imposing on our task specific parameters.
Um, this corresponds to uh,
Empirical Bayes uh, approaches to- for like,
optimize this entire Bayesian model.
Um, so now from here uh,
this- this integral over all possible task specific parameters is- is intractable,
particularly when we have large numbers of parameters, right?
So you don't really have a good way to think about performing
this optimization in any exact way.
Uh, what we can do is we can say well let's fairly crudely
approximate this integral with um,
a point estimate for phi I which is going to represent
the maximum a posteriori estimate of those parameters.
Um, this is a- of course a crude approximation but it is something that
will at least uh- the MAP estimate is a better choice uh,
for these parameters and other choices because it has the maximal probability of course.
Uh, now the question comes in well- okay,
if we're gonna make this approximation to try to uh,
to represent our objective at the top.
How do we compute the MAP estimate?
Uh, well it turns out that uh,
under specific conditions, gradient descent with early stopping, uh,
corresponds to MAP inference with under a Gaussian
prior with mean at the initial parameters and a variance that is uh,
determined by the number of gradient steps and the step size.
Uh, this is exact in linear case and approximate in the nonlinear case.
And so what this means is that if we're doing uh,
kind of getting the MAP estimate by running gradient descent with early
stopping that corresponds to the inner loop of the MAML objective.
Uh, and you can then loosely view MAMLs
approximating hierarchical Bayesian inference in this graphical model.
Of course this involves several approximations.
One approximation that is using
the map estimate and another approximation which is thinking about
gradient descent with early stopping as
map inference in the nonlinear case with neural networks.
Uh, but I think that this kind of interpretation is helpful for
getting some intuition for what these kinds of approaches are doing. Yeah.
What is the stopping [inaudible].
For- for early stopping?
So in the case of MAML we just pick a- a certain number of gradient steps.
We pick one gradient step or five gradient steps.
Um, and the particular- the variance of this Gaussian prior,
is determined by the number of gradient steps that you
use and the step size that you use for those gradient steps.
So I guess unlike typical neural network training,
we're just kind of picking uh,
picking the number of gradients steps rather than choosing
a stopping criterion based of a validation error for example.
So you can essentially view, like,
the initialization of these parameters as serving as
an actual explicit prior in a Bayesian model.
Okay, so you can view this form of initialization
with a few gradient steps as one form of implicit prior on the task specific parameters.
And there are other ways to think about priors that
we could impose on the optimization process.
So for example, one thing we could do is instead of having this implicit
prior that's imposed by only doing a few- small number gradient steps,
we could have a explicit Gaussian prior where we are
actually regularizing the inner optimization to be close to our meta parameters theta.
And so this would correspond to an explicit Gaussian prior,
with mean theta and a variance that is a function of lambda.
So this is basically the form of,
uh, like, the log-likelihood of the Gaussian.
Um, another form of prior that we could do is we can be even more-
even more explicit and actually trying to represent, um,
do basically for a Bayesian linear regression on top of learn features and represent, um,
the mean and variance of that, uh,
represent the mean and variance of,
of that Bayesian linear regression as meta parameters themselves.
Um, so these are kind of two forms of, of, uh,
of gradient-based meta learning algorithms that have
tried to place explicit priors on it.
Another, uh, class of methods have looked at, uh,
just having the prior be
imposed based off of the feature space in which you're learning on.
So kind of similar to this last approach,
where they're doing Bayesian linear regression on the last layer.
There are also a number of approaches that have done
an optimization on top of learned features,
such as performing ridge regression or
logistic regression on top of learned features in the inner loop.
Uh, as well as a support vector machine,
uh, on those learned features.
Um, so essentially these different- these correspond to different inner loops
of the meta optimization algorithm.
And then the meta-training process involves differentiating through
these inner loops by either treating them as a close form optimization,
as a convex optimization problem that you can differentiate through,
um, or other optimizations.
Okay. Um, in this last approach, uh,
this may be out of date now,
but as of a few months ago it was the state of
the art on few-shot image classification benchmarks.
Although, to do that they introduced a number of bells and whistles
in order to get it to kind of reach that.
It wasn't just the,
the, the approach itself.
In many ways those, those bells and whistles are often important for getting,
um, getting state of the art performance on, on benchmarks.
Okay, cool.
So now let's go over a few of the challenges that come up with these types of approaches.
So, uh, one challenge is that we,
we just talked about how MAML you may need much deeper networks in order to be
able to effectively get an expressive gradient update,
that can represent a number of different update rules to your parameters.
And so how do we think about choosing
the architecture for- that is effective for MAML-like algorithms?
And so one idea for this is there is a paper that looked at-can we do
neural architecture search on the MAML architecture such that meta learning works well?
They called this auto-meta in the sense that you're doing both Auto ML and meta-learning.
And one of the things that was interesting about this paper is that they found that
highly non-standard architectures were actually effective for MAML,
um, in contrast to their effectiveness for,
um, for kind of standard supervised learning problems.
So for example they found that deep and narrow architectures tend to work well.
Um, and these were kind of different from the architectures
that work well in standard supervised learning.
Uh, and so for example,
if you take MiniImagenet with the basic architecture,
uh, that reaches around 63% performance, uh,
MAML with this kind of optimized architecture saw an 11%
absolute improvement in performance which is pretty substantial. Yeah.
[inaudible].
Like specifically what the changes were?
Um, it involved-it was
a fairly complex architecture as many of these architectural search things give you.
It was much deeper,
maybe like two or three orders of magnitude deeper,
and from what I remember it was also narrower.
They also had some operations that were a bit nonstandard in Neural Networks.
Something like one-by-one convolutions or something like that,
but it's been a while since I read the paper.
Okay, so another challenge that comes up is that you have
this bi-level optimization procedure that we need to perform in order for meta-learning.
And this can exhibit some instabilities,
particularly if you don't have
as much expressive power that you'd like in your current architecture.
And there have been a few different ideas for trying to mitigate this.
So, um, one approach was to try to automatically learn
the inner learning rate of the, um, the algorithm.
Basically learn that alpha parameter that was in those equations.
And specifically one of the things that these papers have found to
work-to be particularly important here is to learn
in a learning rate that course that is different for each parameter or
different for each-for each layer of the network.
And this is because things like biases
and weights may want to have different learning rates.
Biases may want to have larger learning rates,
weights might-they want to have smaller learning rates,
and you want them to-be able to decouple from those different choices
for the different layers such that they don't have the sort of conflicting optimization.
Um, and then there's also the approach that tried to
tune the outer learning rate as well.
There are approaches that try to authorize
only a subset of the parameters in the inner loop,
subsets like some of the parameters that are,
like, affine transformations on each of the layers.
There are papers that have looked at decoupling
the inner learning rate kind of as I mentioned before,
as well as the batch storm statistics per gradient step,
since that you have a different learning rate for each gradient step,
or a different learning rate or a different set of
batch storm statistics for each gradient step as well.
And then lastly, there are also some papers that have introduced context variables for
increased expressive power that basically introduce variables that- additional, um,
variables into the neural network that are appended onto the activations at each layer,
and allow the gradient steps to store information in those parameters, uh,
in a way that doesn't interfere with the other parts of the network computation.
Um, so for me the kind of the main takeaway-for me these papers
that I think would be helpful for
you is that there's a range of simple tricks that can help the meta
optimization process significantly.
Are there any questions on these challenges before I move on to the next set?
Okay. So one more challenge that I'd like to go into a bit more depth on is
that backpropagating through many integrating steps
is going to be very compute intensive and very memory intensive.
So if you, uh if you have one inner gradient step or a few inner gradient steps
it's generally quite practical to think about how you
might backpropagate through only a couple of those.
But if you have an extended optimization process in the inner loop,
then it's very challenging to think about,
how you'd actually backpropagate through that in a way that doesn't
require storing the entire optimization process in memory.
And doesn't require backpropagating.
Ideally doesn't require backpropagating across-through
that entire- that optimization process.
Um, so there are two approaches that have been kind of proposed for dealing with this.
Uh, the first is a very crude approach for dealing with it,
which is just approximate the Jacobian of the task specific parameters phi i,
with respect to the meta parameters theta, as the identity.
This is a very crude approximation that basically corresponds to, uh,
a kind of truncated backpropagation in some regard where you just take the gradient at,
there's lots of parameters and copy it over to theta.
Um, somewhat surprisingly, this actually works pretty well,
on a number of simple few-shot learning problems like
MiniImagenet like Omniglot that we've discussed in this class,
but anecdotally from what I've found it doesn't
work in more complex meta-learning problems,
such as in meta-learning imitation problems.
Um, but it's something that's- that's probably worth trying if you have,
uh, a setting where you, uh,
are compute bound or memory bound.
It's also, I think aesthetically not as pleasing because it's a bit of a hack in some regard.
Like we know that this matrix is not actually close to identity.
And then, so I guess one way to think about this is,
can we try to compute the meta-gradient without
differentiating through this optimization process
and in a way that doesn't approximate this optimization process as identity?
Uh, and this is where I wanna go to the whiteboard to
kind of discuss how we might try to do this.
So, um, as you remember I guess from last time,
if you write down the, um, the meta-gradient,
you get a form that looks like a single backward pass at your,
er- at your task-specific parameters and then you have
a term that basically is differentiating through the update rule.
So if you have, uh- you basically need to be able to compute d of u,
uh, of Theta with respect to your parameters Theta.
Um, and this requires storing all of the iterates of this update rule.
Ah, if you try to do it kind of with standard back propagation approaches,
and so what we'd like to be able to do is compute
this derivative without differentiating through this- the,
the entire optimization path that got you there.
Um, so the first thing that's worth noting,
is we're gonna have to compute this update- this,
this forward process of the update rule no matter what.
Um, so we're still gonna at least have one full forward pass through the update rule.
Um, but there are things that we can do
to try to mitigate the meta-gradient optimization.
Um, so to write- first let's,
um, kind of write down what this update rule looks like.
Uh, and in particular we're gonna use the update rule that
corresponds- that has an explicit Gaussian prior on the,
um- on the parameters.
And we'll see why this matters in a second.
So let's say that, uh,
Phi is equal to the output of our update procedure.
This update procedure, uh,
takes as input a set of training data points.
Um, and let's say that this is equal to, uh,
the solution to some optimization problem
on the parameters with respect to our training data set,
uh, and then also with respect to
explicit Gaussian regularization that
tries to keep our task parameters close to the parameters Theta.
So this is just a Gaussian regularization with mean Theta,
uh, and variance that's a function of, um, Lambda.
Um, note here, in this case we're gonna be looking
at an inner optimization that is actually to convergence,
as trying to actually take the full argument rather
than just running one or a few steps of gradient descent.
Um, and if we're going to be doing that then actually having
this regularization term is really important because, uh,
if we ignore this term and just initialize it at Theta,
then it actually found the minimum of this function.
Ah, that minimum wouldn't actually be a function of Theta,
and if is not a function of Theta,
then we don't actually- we aren't actually imposing any
prior on that inner optimization process.
And so this is what- that's what the role of this,
um- what this regularization is doing.
It's basically imposing this prior on the inner optimization.
All right, so let's refer to this function right here.
As G of Phi prime and Theta.
Phi prime is- Phi prime is just our optimization variable.
Um, and if we actually find the arg min of this, um, of this function,
then we know that the gradient of G with respect to Phi prime is equal to 0,
uh, because we found kind of the local optimum, right?
Uh, so what we can write- is we can write down that,
if this is equal to 0 then we know that the,
um, gradient with respect to,
um- I'm sorry actually, this is evaluated at Phi prime equals Phi.
So if we know this, then we know that the gradient of
the first loss term- I'm gonna start dropping
the d train for, for notational convenience.
We know that the gradient of the first term plus the gradient of the second term,
which is just Lambda times Phi minus Theta, is equal to 0.
So this is just plugging in, ah,
the gradient of G. And then from there we can rearrange terms a bit.
So, uh, so let's bring Phi on the left-hand side of the equation,
and then we can write out that this is equal to, um,
Theta minus 1 over Lambda times the gradient of Phi- L of Phi.
So here we're just rearranging terms from this bottom equation,
or from this equation right here.
And then if we differentiate this with respect to Theta because we're
trying to ultimately get d Phi d Theta up there,
we get that d Phi, d Theta is equal to the identity,
which is just the derivative of that,
minus 1 over Lambda times the derivative of the outside, which is,
uh, the Hessian at Phi times the chain rule, d Phi d Theta.
Okay, and so interestingly what we could do here is,
we can, uh, arrange the terms a little bit and try to solve for d Phi d Theta, uh,
and the result of that is going to be
something that actually doesn't depend on the optimization path,
it only depends on the final point of that optimization.
Uh, so in particular what we can do is, we could arrange,
um- try to put both- let's put i on one side on its own, and we get,
um- we move this over to the left, we get, uh,
the identity minus 1 over Lambda L of phi,
all of this, times d Phi, d Theta equals the identity.
So this is just moving this term over to the right.
And then what we get is that we get,
d Phi d Theta equals,
basically the inverse of this term.
Okay. So one of the nice things about this is,
well, this is exactly what we wanted to get,
and this only depends on the Hessian at the final optimization point.
And so that means that, this is actually something that we can compute without actually
differentiating through that inner optimization process,
um, which is quite nice.
Uh, and I guess the,
the assumptions that we made to get there is that
the- that basically that this inner solve is accurate,
that it gets to a point where the gradient is 0.
Um, and that we of course can compute this.
Uh, and there are ways that we can try to estimate this by, uh,
using conjugate- the conjugate gradient algorithm,
which tries to estimate basically these Hessian vector products.
Of course this is just, this is just, er,
something that looks like a Hessian,
and then the, um,
the actual gradient that we'll be performing is the gradient
that we derived from last time that corresponds to the,
uh, the back-propagated vector at that optimization point times
this term right here. Yeah.
[inaudible].
Yeah. So the- there are- kind of as I was mentioning,
you can use conjugate gradient algorithms to try to estimate this inverted matrix term.
Um, and basically the number of iterations that you run of conjugate gradient,
the more accurate your estimate of this term will be.
And at convergence it will be- uh,
if you run kind of it for, for, uh,
infinite number of iterations you'll get an exact estimate or you'll get,
uh- you'll approach an exact estimate of this.
Um, if you do zero iterations of conjugate gradient,
you actually just estimate this as the identity,
which is actually just exactly the,
um- which is exactly the,
the kind of first-order MAML algorithm that's written up there.
Okay, so what does this give us?
Um, so this used what's called the implicit function theorem, um,
which is a way to kind of think about how to differentiate functions implicitly, ah,
to get these, uh- to get kind of this form of the gradient.
Ah, and so you
can look at this algorithm and one of the nice things that you get is that,
uh, particularly it's very memory, uh,
efficient, so you don't have to store that optimization process,
and then computationally, as you,
um, increased the number of conjugate gradients steps,
it'll become more compute-intensive, um,
and that allows you to basically trade off how accurate
you want the meta gradient to be with how much compute you want to spend.
Uh, and so in particular, what these plots are showing,
the first is showing the GPU memory on,
ah, a very simple,
um- very simple meta learning algorithm.
And as we see the kind of if- if we changed the number of inner gradient steps,
uh, both first-order MAML shown in purple and, uh,
implicit MAML as this algorithm is called, uh, showed in green,
are constant in memory as you increase the number of gradient steps,
uh, because they don't actually do anything to store those,
um- to store the optimization proce- procedure whereas the full MAML grad- uh,
MAML algorithm increases linearly in
memory as you have to increase the number of inner gradient steps. Yeah.
Since we derive this using the [inaudible] , uh,
does that implies that we want more inner gradient steps to get to the
more accurate, like so the gradient corresponds more accurately?
Yeah, so In practice you would want more inner gradient steps,
or I guess in theory you'd want more inner gradient steps in order to,
uh, try to have this- this be more accurate.
Uh, in practice, we found that the algorithm doesn't need, uh, well,
you could still run the algorithm in practice with,
with- without that condition being true and getting,
uh, gradients that are reasonably accurate.
Um, and then second plot is showing computation time.
Uh, and so we can see that,
for different numbers of conjugate gradient steps shown in green,
gray, and red, we see an increase in the amount of computation, um,
and then, uh, with MAML because you're computing- you're basically computing the,
the full gradient, uh,
you don't have an effective way to trade off how much computation you want,
uh, as a function of the accuracy of the meta-gradient.
Okay. So- and then I guess the other, uh,
benefit of this approach is that it means that you can use
second-order optimizers in the inner loop, uh, you can use really,
uh- you can also include like
basically non-differentiable optimizers in the inner loop because this really just,
uh, depends on the,
the final term and not actually the optimization process itself being differentiable.
Uh, and so for example, we used a Hessian free optimization approach, um,
in combination with this algorithm and we're able to outperform methods that use,
um, just like gradient descent based, uh, inner optimizers.
Okay. Um, and then the last thing worth mentioning about
this algorithm is that it's a very recent development,
and so all the,
all the typical caveats with recent work apply.
Not a lot of people have tried to, um,
kind of play around with this approach and fully, uh,
test its, its, um,
its capabilities and its limitations.
Okay. Any questions on this before moving on?
Okay. So I guess the takeaway for optimization-based methods is that,
uh, you do them by,
by constructing a bi-level optimization procedure.
Where the inner optimization is something like gradient descent,
like SVMs, uh, like Hessian free optimization procedures, uh,
and then differentiate or- or either differentiate through
the optimization procedure or do something like we did
here in order to compute the meta-gradients.
Um, the benefits of this type of approach is that you
get a positive inductive bias at the start of meta-learning,
because you're already had this optimization procedure in the loop,
and you can already expect it to do something reasonable at initialization time.
Uh, in contrast a black box approach is where
the initial optimization procedure is just some neural network.
Another takeaway or a kind of positive note about,
uh, these proc- these procedures is that they're consistent,
uh, and that they- the procedure that you run at test-time
corresponds to an actual optimization method.
And as a result it tends to extrapolate better to,
learning problems that are
outside of the distribution of what you've seen during meta-training.
And we'll get back to this point a little bit more at the very end of the lecture.
Uh, and they're also maximally expressive if you have a,
uh, a very deep neural network.
Oh, and last thing is, uh,
it's model-agnostic so it's pretty easy to combine with different model architectures.
Now, in terms of limitations,
it requires typically a second-order optimization, um,
either by differentiating through that, that optimization,
or by doing something like this where you still have second-order terms that pop up.
And as a result, it's usually pretty compute or memory-intensive.
Okay. So [NOISE] now that we've talked about
kind of embedding gradient-based optimization procedures into the inner loop,
one question you might ask is,
"Can we embed some other learning procedures into the meta-learning process,
without requiring a second-order optimization?".
This is kind of where non-parametric methods are going to fit in.
Er, so in particular what we've
been thinking about is how we can learn parametric models.
And there's this whole other class of machine learning methods, uh,
non-parametric methods namely, uh,
that are simple and work very well in low data regimes.
Um, these are things like, uh,
nearest neighbors for example.
If you have a small amount of data these methods are actually
quite effective, um, at learning.
And during meta-test time,
few-shot learning is exactly precisely in low data regime.
And so these non-parametric methods are likely to perform pretty well.
Er, but of course during meta-training we still want to be
parametric because we want to be able to scale to large datasets.
[NOISE] And so the kinda key idea of these types of approaches is can
we use parametric meta learners to produce effective non-parametric learners.
Um, great, so and I guess
one other kind of side note here is that a lot of these non-parametric methods,
um, preceded some of the parametric approaches that I've been talking about,
but for the sake of this kind of lecture we'll be presenting them afterwards.
Okay. Um, so the key idea here is to use some sort of non-parametric learner.
Um, and one kind of non-parametric learner is to think about doing nearest neighbors.
So if you want to be able to perform this few-shot learning problem,
one very natural approach that actually may have-
someone may have mentioned this earlier in the course is to think about well,
well we how would we just like take this test datapoint,
and compare it to all the training dataset data points.
And look at each of these training data points and find the one that looks the most
similar and then return the one-
return the label corresponding to the one that looks the most similar.
Um, basically compare the test image with your training images.
Uh, now, the kind of the key questions that comes up is how do you compare them?
With what metric do you compare your test image to your training image?
Uh, and if you weren't using something like meta-learning,
uh, what you might do with these types of methods is use something like,
uh, L2 distance between your data points.
Uh, unfortunately with images,
L2 distance is, uh, works very poorly.
Uh, so one really nice example of this is if you take, uh,
this query image on the right and compare it with the two images on the left,
uh, L2 distance is going to return the image on the left.
Uh, and it's doesn't- very good,
doesn't correspond well with, uh,
kind of perceptual distances and more semantic distances between images.
And so the key idea of these methods is instead of
comparing in the space of your observations,
can you learn how to compare using
your meta-training data in a way that is effective for new tasks?
Okay. So, um,
I guess the- the kind of the first type of approach that we might imagine doing here,
uh, was, uh, proposed,
uh, by Koch in 2015.
And what they did is they trained a Siamese Network to
predict whether or not two images are of the same class.
So you're essentially just learning to compare pairs of images,
and saying whether or not they're the same class or not.
And so what this could do is you can learn a more semantic,
uh, distance between two images.
And so you could take these two images,
we know that in our meta-training dataset these are of different classes.
So the label for this Siamese neural network would be 0.
Uh, for this pair of images these are two, uh,
two images that are from the same semantic class and so this would correspond to 1.
Uh, and you repeat this for different pairs of images in your meta-training dataset,
asking the neural network to predict whether or not they're from the same class or not.
Okay. So at meta-training time we are doing these pairwise comparisons.
And then if we want to be able to do few-shot classification at meta test time,
what we do is we compare each image X-test,
to each of your images in your training dataset for that task.
Basically just like we mentioned a few slides ago.
Uh, and then you output the label corresponding to the closest image.
So if your classifier for example outputs a probability of 0.9,
for the third image and a probability of 0.2 for the second image, uh,
and something lower than 0.9 for
all the others and you'd output the one that has the highest, uh,
highest likelihood prediction corresponding to your,
um, corresponding to the label of that, uh, that image.
So that the output the label that has-
that corresponds to the image with the highest likelihood of matching your test image.
Okay. So this is pretty nice,
it's also really simple, uh,
and what we're doing is, at meta-training time,
we're training this binary classifier and then at meta test-time,
we're performing an N-way classification by doing each of these pairwise comparisons.
Now, one thing you might ask is well, okay.
We- we talked a lot about meta-training and meta-testing
and trying to match what happens at meta-training and meta-testing.
Uh, and here we're not like these are,
these are different procedures, right?
We're not actually training it to do N-way classification,
we're training it to do something else.
We're training it to do binary classification.
So is there a way that we can try to match what
happens during meta-training and what happens during meta-testing?
Such that we're training it to be able to be good at N-way classification,
rather than training it to be good, uh, binary classification.
So this was kind of the,
the key idea introduced in,
uh, the matching network paper.
And in particular if we're gonna be doing nearest neighbors at test-time,
in order to match our test query image to each of our training data points,
how about we train an embedding space such that
nearest neighbors produces accurate predictions?
Um, so here's an example of what this looks like.
So we take each of the images in our training dataset,
we embed them, uh,
into a learned embedding space.
We then take our test query image and also embed that into an embedding space.
And we compare each of, uh,
each of these embeddings to make a prediction.
So, uh, each of these black dots here will correspond to a comparison between the,
uh, the test embedding and the training embeddings.
And then we'll take the label corresponding to each of our training images,
and our prediction will correspond to the weighted, uh, nearest neighbors.
Basically the, the weighted, uh,
labels of each of
the training images or the training labels weighted by their similarity score.
Okay. And then, uh, once you do this,
you can then train your neural network end to end in order to
make effective predictions on your test data points.
So the particular architecture they used in this paper
was to use a convolutional encoder, uh,
to kind of embed the images and they also used
this bi-directional LSTM to produce the embeddings of each of the,
the training data points, although in practice you could
choose- you could choose simpler things for these,
uh, for each of these models.
Um, as I mentioned the model is trained end-to-end, uh,
and most critically here meta-training is about showing what's happening at meta-testing.
So during meta-training, you're training it to make
comparisons to all the images in your training dataset,
and at meta test-time you're doing the same thing.
Making predictions, uh, for each of the, uh,
for the N-way classification problem that you're going to be doing. Yeah.
[inaudible]
So in this case I think that the motivation here was such that the,
basically information about two different classes,
can be spread to one another basically.
Like if you're trying to be,
uh, classifying between, uh,
four different types of dogs for example,
versus between a dog and a cat.
Uh, the way that you represent your embedding might be different.
For example, if you're trying to classify between dogs and cats,
your embedding space- your embedding of a dog should represent, uh,
something that is, uh,
something that is, is kind of general to all types of dogs.
Where if you're trying to do a more fine-grained classification
of classifying between two types of dogs,
then you want that embedding space to be more discriminative based on the type of dog.
Yeah.
[inaudible]
Right. So with- because of the LSTM,
it does actually impose an order for this- for this particular architecture.
In practice, the, um,
there- and there are other- if there are
other non-parametric methods that aren't order dependent.
L- and the next method that we'll talk about is not order dependent.
I think that for this particular paper,
they chose the order arbitrarily.
Okay. So how do we actually go about training this?
Uh, so the general algorithm looks basically the
same as the algorithms that we're taking- that we were looking at before.
So if we take the- the algorithm corresponding to the
amortized or corresponding to the black-box approach,
um, if we want to think about how you do this for matching networks,
we first sample a bunch of tasks.
We sample a train dataset and a test dataset for each of those tasks.
We then compute predictions using this learned similarity metric.
Uh, and note here that unlike the parametric methods,
we don't have these parameters, Phi.
They're essentially integrated out into this comparison.
And hence, it is a non-parametric approach.
And then once we have these predictions,
then we update our- the parameters of this learned embedding
function with respect to
the loss function of how accurate our predictions are on the test set.
Uh, although note here that I'm abusing notation a bit,
and that those kind of- this- this loss function would be something like cross entropy,
for example, and we'd use the predicted distribution over test labels,
rather than only the- only the output that the max labeled.
Okay. So any questions on how you would go about training matching networks?
Okay. So now, one thing that we might think about is, well,
if we're doing one-shot classification where we have one example per class,
this is pretty, uh, straightforward,
uh, because we're gonna be making, basically, uh,
we're gonna be making, uh,
making comparisons to each of those classes.
But what if we're in the case where we have more than one shot?
Um, if we have more than one datapoint per class,
then what matching networks is gonna do is gonna be
performing these comparisons independently.
And so if we have, uh- if we're doing dogs versus cats, for example,
we have two dogs and two cats,
it's just going to find the closest image,
and output the label corresponding to that.
Or to basically to do a weighted average of those,
and look at the two dogs independently,
and look at the two cats independently in our training dataset.
And so one thing you might think about is,
well, maybe it makes sense to think about a more aggregated, like,
to- to aggregate information per class in
a way that is smarter than just performing these independent comparisons.
Uh, and that's what,
uh, prototypical networks do.
So they think about how can we aggregate
class information to create a prototypical embedding of that class,
and then perform comparisons to each of
those prototypical cla- prototypical class embeddings
in order to predict the label corresponding to our test image.
Okay. So what this more concretely looks like
is we'll have a number of images for different classes.
So here, different colors correspond to
different image classes in our training dataset for a particular task.
And then we embed those- each of our training images into this embedding space.
And then take the average in this embedding space
in order to compute the prototypical embedding for class 1,
class 2, and class 3.
And then we embed our test image into the same space,
same exact space and compute, uh,
the distance to each of those prototypical class embeddings.
Uh, and then we can output the one- output the class
for which it is closest to in this embedding space.
So what this looks like in equations is we'll,
uh, embed our, uh,
each of our images in for a particular class, uh,
into this embedding space and then take the average for each of,
uh, for each of those images.
I think it can be this prototypical embedding C_k for class k. And
then to compute which class- to compute the class of our test datapoint,
we will take the distance between the embedded test datapoint in each of those classes,
and perform a softmax over each of those, um, each of those,
uh, negative distances, in order to compute the probability for the test datapoint.
Um, and then in this case,
d can correspond to Euclidean distance or cosine distance.
Some kind of your favorite dif- distance metric, uh,
in this, but computed in this learned embedding space.
Okay. So this is an algorithm that, uh,
you'll be implementing in homework too.
Are there any questions on how it works?
Okay. Cool. So this is prototypical networks.
Um, basically, what it corresponds to, uh, I guess,
basically what many of these approaches correspond to is basically embed your datapoints,
and then do nearest neighbors in that learned embedding space.
Now, one challenge that might come up is, well,
what if you want to do- reason about more complex relationships between datapoints,
rather than just doing nearest neighbors in your embedding space?
Um, in principle, if you have an expressive enough encoder,
uh, in your embedding space,
then nearest neighbors should be able to represent a wide range of complex relationships,
uh, particularly for high-dimensional embedding spaces.
But in practice, people have found it to be useful to think about
more expressive ways to perform these types of comparisons.
So for example, um,
relation networks basically takes prototypical networks and
learns the non-linear relation module on top of those embeddings.
This basically just corresponds to learning that function D in prototypical networks,
instead of using a Euclidean distance metric or a cosine distance metric.
So it's learning both the embedding and the distance metric.
Uh, another approach is to- instead of having a single prototype per,
um, per class, have a mixture of prototypes per class.
And this allows you to for example, represent, um,
more multimodal class distributions to maybe one class,
um, maybe if you have, kind of,
the class dog, maybe, uh, dogs are often either seen on snow or on grass,
and you want to be able to represent both of those, uh,
both of those kind of modes of your class in your embedding space.
It may be easier to try to allow your embedding space to have a multimodal distribution,
rather than trying to kind of change your embedding space in
a way that puts them all in the same part of the embedding space.
Uh, and then lastly,
uh, another paper has looked at.
Can we embed, uh,
perform an embedding on all of our datapoints,
and then do some sort of message passing
scheme in order to think about how these different,
uh, datapoints relate to each other,
and in order to make the predicted output?
And what this does is it uses graph neural networks in order to
perform this message passing and differentiate through it.
Okay. Um, so now,
that's kind of mostly it for non-parametric methods.
Uh, they're- they're quite simple.
And we'll talk a bit about some of the takeaways of these methods, uh,
in kind of the last part of this lecture,
where we talk about how we can think comparing these approaches.
Okay. Um, and so,
I guess, as a more meta-point, we have all these algorithms,
they're really these kind of three classes of approaches that we've talked about;
black-box adaptation, optimization-based approaches, and non-parametric approaches.
Uh, and so how should we think about how these different methods compare?
Uh, I think that there are a few different ways to think about this.
And so I'll present two different ways,
uh, to think about this.
Uh, and the first is to think about the kind of computation and graph perspective.
How do these different algorithms look like as different computation graphs?
And we- we visited- we kind of talked about this viewpoint earlier, um,
where the black-box approaches are
representing this computation graph at a completely back- black-box approach.
Whereas the optimization-based approaches,
you can view them as embedding an optimization into your computation graph.
Uh, and for non-parametric approaches,
you can also take this view.
And in particular, what the computation graph will look like,
uh, for prototypical net- networks,
for example, is something that,
uh, for your test datapoint, uh,
embeds it and compares it to each of your per class prototypes,
where those per task prototypes are computed, um,
using the embedding of each of those,
uh, each of those class's datapoints.
So you can essentially just view it as another kind of computation graph,
where we're embedding the soft nearest neighbors into the computation graph.
Okay. So with this view, um,
we can also think about how we can mix and match components of the computation graph,
uh, to get hybrid types of approaches.
Yeah.
[BACKGROUND]
Yeah.
So you could essentially view all of these methods as a computation graph.
And whether or not that compu- like,
the optimization-based methods and non-parametric methods are
essentially imposing a certain type of structure inside your computation graph,
that corresponds to things like gradient descent and nearest-neighbors,
whereas black-box methods tend to not impose any structure on that process.
And there is a bit of a gray line between like,
what- what computation graphs look more like
non-parametric methods versus what computation graphs look more like black-box methods.
Um, I think it's helpful to think about
this- this kind of different classification of methods,
because it allows us to think about the certain properties of these, these methods.
Although in practice, um,
there isn't a very clear cut line between them.
Yeah. [BACKGROUND]
Yeah, that's a good point. So it should be over x y such that y equals k. Yeah,
exactly. And I'll try to fix that.
I noticed that on the previous slide and I'll try to fix that on the,
um, before we post the slides online.
Okay, is there a question in the back? All right so, um,
we can again because we can think of these as computation graphs that also is
pretty easier to think about how we might try to mix and match components of these.
So one approach which is a bit of a hybrid of black-box approaches
and optimization based approaches or
maybe optimization-based non-parametric depending on the way you view things,
um, is an approach that tries to condition a model on an embedding of the data set,
of your training data set and also run gradient descent on that model.
Uh, in practice this sort of- these sources of information by conditioning on
the data with a direct way as well as through
gradient descent could potentially be redundant.
Um, although in practice it seems like this method found some benefit in doing that.
Uh, another idea here that you could do is have some sort of
embedding of your function and then do gradient descent on that embedding space.
And so in particular they have this paper user relation network
to embed your training data set and think
about how different data points relate to one another.
And then they decode this embedding into the parameters of
a neural network that makes predictions about new data points.
And then instead of running gradient descent on the parameters of that function,
they run gradient descent in the learned embedding space Z,
uh, which produces, uh, different functions.
So this you can essentially view it as running gradient descent on
a lower-dimensional space of your weights rather
than running gradient descent in the original space of your weights okay?
And then the last, uh,
approach that I'll cover is that there's also
an approach that looked at doing something exactly like MAML.
But initializing the last layer of the network to correspond to prototypical networks.
Um, so it's basically specific form of a particular choice of
the network architecture for MAML that
initializes it to do something like a comparison-based,
um, a comparison based prediction.
Okay so that's the kind of
computation graph view is one way to think about how these different algorithms compare.
And the other way that I like to think about how these algorithms compare is to think
about the different properties of the individual algorithms.
Um, and in particular I think there are two properties that are really important as
we start to think about developing meta-learning
algorithms and developing new meta-learning algorithms.
The first is thinking about expressive power of these algorithms.
So we talked about this a bit before, it's basically
the ability for that function f that I showed on
the previous slide to represent a range of learning procedures as a function of,
um, as a function of your training dataset.
And the reason why this is important is that it means that
as you get more larger and larger meta-training
datasets you'll be able to produce a more flexible range of learned optimizers.
And as you apply these types of algorithms to more challenging optimization problems,
you'll be able to do better than the kind of
the standard optimization procedures we have today.
So essentially it has to do with scalability and where these methods will
end up in the future. Ah, if you could only represent
a small class of algorithms then you're, you may not be as
effective when moving towards broader meta-training datasets.
Okay so this is the first property and the second property which I
alluded to very briefly before is the property of consistency.
And in particular what I mean by consistency is
that the meta-learning algorithm will produce
a learned learning procedure that will solve
the task with enough data regardless of the properties of that task.
It will essentially produce a consistent learning procedure such that given enough data.
What I mean by consistent learning procedure is one that will kind of
asymptotically solve the task given enough data at that task.
And so for example things like gradient descent correspond to
a consistent learning procedure because we're just running gradient descent at test time.
And you can expect at the end of
given enough data for that test task you'll be able to
solve a task regardless of what your meta-training data was.
Now what is, why is this
important? Well first getting meta-training data
that corresponds, oh, I'll get to your question in a second.
First getting meta-training data that
corresponds closely to what you'll be seeing at test-time is pretty hard.
So we haven't really talked about this much but
we've been assuming that we have this meta-training dataset.
And we can use this meta-training dataset in a set of
tasks in order to do well at new tasks.
But in practice, how do we determine what those tasks actually correspond to?
This is actually a really hard problem as we think
about where we are going to be applying these algorithms.
And so if we produce a consistent learning procedure, then we
can expect it to do something reasonable on tasks that aren't necessarily,
uh, especially close to the meta-training tasks that we trained it on.
And we can also get basically get better out of
distribution task performance. There's a question.
[inaudible]
Yeah, so the I guess the question was about, um,
like can't we have a catastrophic initialization
such that gradient descent doesn't actually give us a good solution.
Um, and I guess the short answer is yes.
I guess one thing we can
assume with gradient descent is that we'll at least get to a local optimum.
Whether that local optimum is good or not it is a kind of
another question and it could be that we could have an initialization
that puts us in a place that the local optimum is actually very
bad for that particular basin.
Um, so absolutely and
that's something I think that people haven't thought about quite as much yet.
And so thinking about how we might, um,
how we might try to tackle that sort of problem and does it actually happen in practice,
is do we actually get to a local optimum that are bad is kind
of another question. The-
[inaudible]
Something like that. Yeah yeah.
So there are other ways to try to think about how
you might formalize this problem as well.
And something like basically monotonic improvement could be one of them as well.
Uh, yeah.
And I guess theoretically thinking about that sort of problem may also
draw very closely to some of the theoretical questions in deep learning in general,
like are local optima good with high probability,
things like that, and people have
started- People have looked at that sort of problem in the past.
Okay. Um, and I guess one thing I was going to mention here is that,
[NOISE] we recall that is in practice things like gradient descent.
If we're running that at test-time,
do tend to empirically actually hold up with regard to getting better
out of distribution task performance in comparison to black-box adaptation approaches.
Okay. Um, and so in my mind these are the kind of the two properties that are
most important for many different applications of meta-learning.
Not necessarily the benchmarks that we have because the benchmarks are well defining
a nice set of meta-training tasks and sort of meta tests tasks,
test tasks that  kinda closely align with that.
But in practice if we're thinking about applying these algorithms on real problems,
these are the two properties that I think that we're gonna care about the most.
Okay. Um, and so let's think about how these- how
the different algorithms compare on these different axes.
So for black-box approaches we know that they have complete expressive power,
ah, in the respect that things
like recurrent neural networks are universal function approximators.
Ah, and we also know that they're not consistent they- and that they won't, ah,
if you are imposing any structure on the function,
on- on your black-box function,
then they won't- there's no guarantee that they'll produce anything that is consistent.
Okay. For optimization based methods we know that, ah,
it reduces to gradient descent,
ah, at least for things like MAML.
Ah, and so in that sense, they're consistent, ah,
for some definition of consistent with regard to things like monotonic improvement.
Um, and we also know that they are expressive if you have,
ah, deep enough models.
Um, and in practice we've found these methods to perform, ah,
well on- on- on settings where you
do want to be fairly expressive with regard to few-shot learning algorithms.
Ah, I put an asterisk here because this
actually doesn't hold up in some reinforcement learning settings,
and we'll potentially talk about this a bit later in the course.
It mostly holds in supervised learning settings.
Ah, it also depends on the particular reinforcement learning algorithm that you use.
So it's a bit of a nuanced,
ah, a nuanced thing that we'll discuss later.
Ah, and then with regard to nonparametric approaches, ah,
these methods are expressive for most architecture choices.
Ah, for example, if you're using things like LCMs then there are a wide range of,
of functions that you can represent, ah,
although there's a bit of nuance, ah,
depending on the types of learning algorithms that you might want to learn,
ah, and that they're also consistent under certain conditions.
Ah, so they are consistent in the sense that if your embedding,
um, is not losing information that is, ah,
losing information about the inputs that is
not- that- that's important for making decisions,
then, ah, as you accumulate more and more data, ah,
you'll eventually get something I do- kind of
asymptotically it will eventually get to a datapoint that's
arbitrarily close to your test data point,
and then be able to make, um,
the correct prediction for that test data point.
Okay. Um, so beyond this aligor- beyond
these properties there's also other properties that are pretty
important for thinking about with regard to different applications.
Um, so things like,
I- being really easy to combine to- with a variety of learning problems.
Ah, this is true for black-box approaches because it's,
it's really easy to basically just plug in your,
ah, plug in different loss functions or
different optimization procedures into these types of architectures.
Um, the downside as I mentioned, ah,
last time is that it does invo- involve a challenging optimization in
that there's no good inductive bias at initialization to point it in the direction of,
ah, a real optimization procedure.
And as a result, they are often fairly data inefficient
because you have to learn how to learn completely from scratch.
Okay. With regard to optimization-based methods as we talked about earlier,
we have this positive inductive bias at the start of meta-learning because
we're initializing it with a real, ah, optimization algorithm.
Um, it can handle- I guess one thing I didn't mention it before is it can handle
varying K and large K, ah, relatively well.
If you have more data than what you trained on for example,
um, these approaches still tend to work well because they're consistent.
Ah, and they're also
model agnostic in the sense that you can plug in different architectures and apply them,
ah, with, with- conceptually with-without any difficulty.
The downsides as I mentioned before, you know,
we have a second-order optimization and it's
usually compute intensive and memory intensive.
Ah, and these two of, these two points are quite
important for our range of applications where you care a lot about compute,
ah, and memory particularly when you're scaling to large datasets.
Okay. And then with regard in non-parametric methods, ah,
we didn't cover the pros and cons,
of this one yet other than these two.
So the first is that,
one of the nice things about these methods is that they're
entirely feed forward architectures,
they don't involve any,
any backpropagation within that computation graph.
And so as a result,
they tend to be very computationally fast and they tend to be
very easy to optimize in contrast to architectures that involve recurrence,
that involve, ah, gradients pushing backwards etc.
Um, and then some of the downsides of these approaches is that,
ah, they're hard to generalize to varying K. This is more
of a empirical observation that people have found.
Is if you test them on more K than what they're trained on,
they tend to underperform what,
what other algorithms are able to achieve.
Ah, it is also harder to scale these to very large datasets,
ah, at test-time because they're using non-parametric approaches.
And so far these methods have also been limited to classification.
Ah, in principle you could also apply them to things like regression.
But the- with the caveat that you could
only interpolate between the labels if you kinda, ah,
if you naively apply these approaches to regression,
you could only interpolate between the labels that you saw in your,
ah, in your task specific training set.
Because we're just doing a weighted average of those labels at test time.
Okay. And then at a more high level,
it's worth mentioning that generally well tuned versions of each of
these algorithms tend to perform comparably on existing few-shot benchmarks
and as I alluded to before, things, various bells and
whistles like using ensemble as they're using, or tuning the architecture can lead to,
ah, are, are often the kind of the differentiating factor between these methods,
rather than the actual underlying method itself.
Um, this likely says more about the benchmarks than about the approaches themselves.
Ah, and I think that, basically in,
in many cases which method you want to use will depend heavily on your use case,
and whether or not you care about things like consistency,
whether or not you care about expressive power,
whether or not you care about computational efficiency, ah, etc.
Okay. Any more questions on these- kind of how these algorithms compare?
And when you might use one versus the other?
Okay. So I guess that's mostly it.
The, um, for today.
So kind of to, to recap.
We talked about, ah, two algorithmic properties.
We also talked about a computation graph perspective.
Ah, one third property that's useful to
think about is thinking about uncertainty awareness.
I- by this I mean kind of the ability to reason
about ambiguity during the learning process at test time.
Ah, and the reason why this is important is
that if you want to do things like active learning,
or have calibrated uncertainty
estimates when you're learning from small amounts of data,
or if you're in reinforcement learning settings and you want to
reason about what data you should collect in order to,
to reduce your uncertainty about the task,
then you need to have some notion of, of your uncertainty.
Ah, and uncertainty comes up especially in few-shot learning problems,
where you only have a small amount of data and your prior can't necessarily make up for,
ah, what the true task is.
Um, and the kind of- the other kind of place where this comes in is
that we talked about this really nice Bayesian motivation at
the beginning of the course in the- in the second lecture,
in the third lec- second or third lecture, and,
ah, we've kind of moved to be- moved to fully deterministic approaches.
And in the next lecture we'll talk about
basically more principled Bayesian approaches that get back to that initial motivation,
and also give us things like calibrated uncertainty or
more calibrated uncertainty and approaches that allow
us to think about how we could collect more data to reduce our uncertainty.
Ah, and we'll discuss all of those things on Monday.
Um, on Wednesday we have, ah,
student presentations again, ah,
that will be covering various, um,
various algorithms and extensions of the things that we've been talking about,
ah, and then a few more reminders again, Homework 1
is due on Wednesday.
Please fill out the poster, ah,
presentation preferences for the dates, ah,
we need to know kind of when you're available,
and information about the course project is online. I'll see you on Wednesday.
