Welcome to the second lecture.
First we'll cover just some logistics.
So, uh, Homework 1 will be posted today, uh,
and it's gonna be due Monday,
October 7th and we'll cover actually some of the topics that will be,
uh, in Homework 1 today.
Uh, so pay attention if you wanna be able to do Homework 1.
Uh, fill out preferences for papers by tomorrow so that we can start,
uh, assigning papers to you.
We'll- we'll assign papers as quickly as possible especially for
the first group that's presenting on Thursday next week.
Uh, also a reminder that the TensorFlow review session is tomorrow
on- at 4:30 PM in Gates, B03 that's gonna be led by, uh,
and then if lastly for enrollment,
if you filled out the form and, uh,
have not received a permission code then at this point I beli-,
unless you filled out the code like a minute ago,
I think we've handled, uh,
basically all of the cases so just send us, um,
send the staff an email or a post on Piazza asking about that.
And we can, uh,
look at that on a case by case basis.
If you need to- if you still haven't filled out the form please do fill it out.
Although at this point, um,
maybe let us know if you fill it out because we may not be
mo- be monitoring it quite as frequently.
Okay, any logistical questions before we- yeah.
What time is the homework due? [NOISE]
The homework will be due midnight on Monday October 7th
and the date will also be posted- the time and date will be posted on the assignment.
Okay, great, so the plan for today,
uh, we actually have a lot of material to cover.
So, uh, first we'll be talking about some of the basics of multi-task learning.
Um, things like the types of - the type of models and architectures that these algorithms
typically use as well as how we
actually go about the training process for multi-task learning.
Uh, some of the challenges that arise when you try to train across multiple tasks and,
um, a case study of actually trying to use
multi-task learning in a real-world application.
Uh, and some of the challenges that come up in some of,
like, the results that were there.
Uh, then we'll have a short break around like two minutes or so.
Uh, and then we'll be covering, um,
some topics- uh, some kind of basic topics in meta-learning.
We'll- we'll get into more advanced topics in future lectures.
This will include things like the meta-learning problem formulation, uh,
a general recipe of meta-learning algorithms, uh,
and kind of how you go about building these types of algorithms.
And then lastly we'll talk about a specific class of
meta-learning algorithms called the-
that I'm gonna refer to as black-box adaptation approaches.
Uh, and these last two things, uh,
actually how you go about setting up a meta-learning algorithm,
and implementing a simple approach to meta-learning will be the topic of Homework 1.
Okay, so let's get started,
um, with multi-task learning.
So and actually- before we even get into multi-task learning let's look at some notation.
So, uh, say we have some neural network.
The neural network is gonna take in some input x and produce some output y.
Uh, and the input might be something like a picture.
Uh, maybe the output is the class corresponding to that picture,
uh, the class of the animal for example.
Another example of an input might be like the title of a paper for example,
maybe you're, uh, trying to decide which papers
to read or which papers that you wanna review.
And then you kind of want to decide whether or not the paper's gonna be too long or not.
Um, this is essentially a variety of classification type problems you might imagine.
Um, the weights of the network we'll use,
uh, theta to denote the weights.
And then you can basically view this neural network as
producing a distribution over outputs Y given the input X.
Of course this could also be a deterministic function,
um, viewing it as a distribution as a generalization of a deterministic function.
Okay, um, so in single task learning the way that
you- this should mostly be review for all of you.
Uh, so single task learning,
at least the supervised single task learning problem we have assumed some data-set
where we have input-output pairs.
Uh, we probably have many input-output pairs corresponding to that data-set.
And our goal is to minimize some loss function that is a function of,
uh, the parameters and the data set.
We want to minimize that loss function with respect to the parameters.
And so one typical loss function,
uh, that we may use is something that looks like negative log-likelihood.
Uh, so our loss function is going to correspond to the expectation of
the data points in our data-set, of the log probability of our predictions.
Um, the log probability of the labels under our neural network predictor given the input.
Okay so this is kind of standard single task learning.
The way that you would typically optimize this log-likelihood
is by back-propagating into the parameters of the network.
And running something like SGD or your favorite, uh,
optimizer, Adam, Autograd, Momentum, whatever.
Okay. So now we'll actually call a task.
Uh, we talked about this a little bit, uh,
last time on Monday and we'll
try to talk about this a little bit more formally this time.
So, uh, the way that we're gonna define a task,
for the sake of
basically the first half of this course before we get to reinforcement learning,
is something that looks like this.
Where we're gonna have some distribution over inputs.
A distribution over the labels given the inputs and a loss function.
Uh, and essentially these two distributions P are gonna
correspond to the true data generating distributions.
Um, and so of course we don't actually have access to these data distribution- data
generating distributions but it allows us
to kind of reason about what we're gonna be seeing at training and test time.
In particular what we're gonna be seeing, uh,
the kind of things that we will have access to, um,
is we'll have a training set and a test set for each task.
Uh, and we'll assume that each of these distributions are drawn from
the same data generating distribution P or Pi.
So essentially what a different task corresponds
to- different tasks may vary based off of the input distribution,
the label distribution given the input or the loss function.
Okay. Um, one notational note,
uh, is that we're all- in future slides I'll be
using Di as shorthand for the training data-set.
And yeah we'll get to that in a couple- in a couple of minutes.
Any questions on kind of the basic setup? Yeah.
[inaudible]
Oh that just means that we're- we're defining the task to
be equal to the thing on the right.
Okay so, um, this is our task.
Uh, task- for each task we have corresponding data sets,
uh, sampled from, uh, some distribution.
Let's look at some examples of what different tasks may be.
In particular what different sets of tasks may be.
What- what kind of different problems they correspond to.
So for example, uh,
we may have a multi-task classification problem,
um, where we want to be able to perform different classification tasks.
Uh, and in this case,
the loss function will be the same across all of the tasks.
The loss function will be something like cross entropy loss for all the tasks.
Um, but the- the inputs like the images that you're seeing for each task or
the labels corresponding to those inputs may vary across different tasks.
For example, uh, maybe different tasks corresponding- correspond to writing,
uh, to be able to recognize handwriting digits,
uh, on a per language basis.
Maybe you wanna like one task is to be able to recognize, uh,
have characters corresponding to Braille for example.
While another, uh, corresponds to another language.
Um, another multitask classification task might be like a personalized spam filter.
Uh, the emails that I get are probably very different than,
uh, the emails that, eh,
a high-schooler gets for example.
Or, um, like a professional athlete for example.
And so the types of spam will be different.
Uh, so classifying spam for one person could be
one task and classifying spam for another person could be a different task.
Um, In that case, the- the loss function is the same across all tasks.
The P of X is different because the types of emails
that each person will be receiving- the distribution of emails will be different,
so you'll have a different Pi of X for each person.
And also different things might be, uh,
one thing- one thing that might be spam for
one person might not be spam for the other person.
And so you may have a different P- Pi of y given x.
Okay, um, so that's- that's one example.
Another example, uh, of
a multi-task learning problem is what's called multi-label learning.
In this case the- the loss function
and the distribution over the inputs is gonna be the same across tasks.
And what's gonna be different is you're going to want to be able to make predictions, um,
different, they basically will be able to predict different labels, different, uh,
perform different kind of output to tasks,
uh, given your input image.
So one example of this is attribute recognition.
So if you have a data-set of faces in this A data-set for example.
One task might be correspond to detecting whether or not the person is wearing a hat.
Another one might be correspond to detecting their hair color, things like that.
So there p of x is the same.
The loss function is the same but y given
x is different because you have different binary classification tasks.
Um, another example is something like seeing and understanding where maybe
given an image you want to be able to predict,
uh, the depth of that image.
The key- different key points, uh,
the surface normals, uh, of that image as well.
So this is a common problem in computer vision.
Um, so both these examples were- were settings where
the loss function was actually consistent across all tasks.
Uh, and what was varying was the data distribution.
Uh, and so what our settings were, the loss function might vary across the tasks.
So one setting is maybe one of your tasks
corresponds to predicting a discrete variable whereas
another one corresponds to predicting a continuous variable.
Uh, in that case, um,
maybe one task- one task might have
a cross-entropy loss function whereas another might correspond to mean squared error,
or something more sophisticated than that.
A different way to represent the likelihood essentially.
Another setting where the- it might vary is
actually if you care more about one task than another task.
Uh, maybe you care more about, uh,
accurately predicting the surface normals than about,
uh, the depth for example.
Um, then you might, kind of,
increase the - have a loss function weight
corresponding to one task that's higher than the weight of another task.
Okay. Ah, so here's some examples of kind of
different multitask classification or
different multitask learning problems that we might want to solve.
Ah, how do you actually go about solving them?
So, ah, we can go back to our neural network.
Ah, and if we want to- if we want our neural network to do multiple different things,
then we need to tell it what it should do.
And so in some ways we kind of- we need to condition on the task in some way.
And so what we're gonna have is we're gonna have some task descriptor Z.
And we're going to pass this task descriptor into the neural network in some way.
Ah, and so I'm just gonna draw an arrow towards the neural network to indicate that.
And in a minute we'll talk about different ways
that you can condition on that task descriptor.
And of course, now our neural network is not gonna be just a function of x
but it's also going to be a function of this task descriptor Z_i.
Okay. So for example, um,
in this case maybe x kind of we can go back to
the paper title example, maybe x corresponds to paper titles.
Uh, and now instead of- oh, it's back.
Okay. So maybe try,
try- instead of trying to figure out what, uh,
what paper you wanna read, uh,
you now need to review the paper, uh,
and produce a review for a conference and maybe you're a bit lazy.
So you, you ask your neural network to predict a summary of the paper for you.
So that you can then base your,
uh, review on the summary.
Or maybe, maybe you're even lazier,
and you just want to output the paper review directly.
[LAUGHTER] Uh, hopefully, these aren't the reviewers for your papers.
Uh, but, um, yeah.
Okay. So what is this task descriptor thing?
The very- the simplest example of what this task descriptor might be
is just a one-hot encoding of the task index,
uh, just a vector that tells you, uh,
if it's task 1, task 2, task 3, etc.
This is just kind of an index,
uh, corresponding to the task.
Uh, but more generally, it can also be, uh,
whatever meta-data that you have about that task.
Um, if your task corresponds to a personalizat- personalization,
then the task descriptor might include information about that person or that user.
Uh, if it's something, uh,
maybe you have kind of a language description of the task you wanna perform.
So you could provide that language description to the network.
Uh, or maybe you have some formal specification of the task or maybe you're
trying to design circuits like was- because we've mentioned it on Monday.
Then you might have kind of a formal specification of the type
of circuits that you might want your neural network to be producing.
Okay. So this is the task descriptor.
And then, our objective now becomes,
uh, basically the same as what it was before,
but now we're going to be summing over all of the tasks that, um, that we have.
Okay, so this is the basic setup.
Um, now we have a couple design decisions to make.
So we have a model and we have an objective.
Uh, and so we have a decision for both of them.
The first is, how should we design the model and the second is,
how should we optimize the, uh, the objective.
So ,um, particularly the big decision that we have to make in multitask learning
is how do we condition on this task descriptors Z_i and the,
uh, algorithm decision like I mentioned is how do we optimize our objective.
So let's first talk about this,
um, this first question.
So, um, this is all about conditioning on the task.
Let's assume that, uh,
Z_i for now is just gonna be the task index,
like a one-hot encoding over a task index.
And, uh, we only know kind of information about our task is this index basically.
So I have a question for you which is,
how should you condition on Z_i, uh,
in order to share as little as possible between the tasks in your neural network model?
Essentially, how should you structure your neural network model such that it
shares basically as little as possible between the different tasks. Yeah.
[inaudible]
Yeah, exactly. So what you can do is you can have separate networks for
each task where the- you have- each,
each of them have completely separate weights.
And then the way that you condition on your task descriptor is
you pick the corresponding task which corresponds to this sort of
multiplicative gating where you're just selecting which of the outputs you're
going to produce based off of your, your task index.
Um, this is basically, uh,
a complicated way to say that we're just gonna be training independent networks.
Uh, and it's kind of one way to represent
independent networks within a single model essentially.
Okay. Um, so we get independent training within a single neural network,
uh, and there's no shared parameters across tasks.
Now, what about the other extreme.
So the other extreme would be
something like this where we take our neural network and we just
concatenate the task index somewhere either at the activations or at the input.
Um, and then train the network as normal.
Um, and in this case,
all of the parameters are shared except for one tiny part of the parameter vector.
What is the, the kind of tiny part of the parameters that's not shared?
[inaudible]
Sorry, repeat that.
[inaudible]
Yeah exactly. So basically the weights that are, uh,
right after the Z_i, the kind of,
if you have a fully connected layer right after Z_i concatenated with your features,
the, the part of that matrix that corresponds to, uh,
that is basically right after Z_i will have basically different,
um, different components that,
that are not shared for each of the tasks.
Other than that half of that matrix,
all the other parameters are shared. Yeah.
There's also kind of forces [inaudible]
Yeah. Yeah, exactly. So in this case,
we assume that all the inputs have the same size,
the same, uh, dimensions.
One thing that you could do is,
uh, if, if different tasks have different sizes,
you can basically like, you would have some sort of,
uh, recurrent neural network,
or some sort of attention based model that basically
aggregates over the variable dimens- like if you- if one of them is time for example,
it aggregates over that, uh,
whereas maybe some tasks have- are text and others image.
Images in that case, you would probably wanna have different, um,
different first parts of that network to take in those different modalities of data.
And we'll show- we'll see like an explicit example of,
of how that has been done in the past. Yeah.
So [inaudible]
So this is, yeah, this is a good question.
It's, it's a f- it's a fairly nuanced point.
So basically each, um,
you can ba- you can view, uh,
the first- a fully connected layer corresponding to,
um, a weight matrix times a vector.
And when you, um,
when you take the,
the top part of that matrix will correspond to- or sorry the left, um,
rows of that- the left columns of that matrix will correspond to the,
the features and the right columns of that will correspond to the task,
uh, task vector that's being processed in this, in this input.
Uh, and if you essentially- if the D_i is a one-hot vector,
then you'll have one entry in that matrix that is 1,
and the other we- entries in that,
in that vector will be 0.
Uh, and so basically all the ones where all of the columns that have, uh,
an entry in Z_i that corresponds to 0 will be
zeroed out and won't be used for that task,
whereas the columns that have the one will be used for that task. Yeah.
[inaudible]
Uh, technically, although, um,
if you, kind of mathematically go through the gradients,
the gradients for those columns will be zero for the tasks in which,
um, for the tasks in which the input is zero. Yeah.
[inaudible].
Yeah. Exactly. So if z i was not a one-hot vector for example,
then you would have that- that part of the matrix would be shared parameters.
Uh, another kind of, ah,
somewhat interesting fact about this is if you could- if you could concatenate
z i at every single layer of the network you'll have more task-specific parameters.
Whereas if you condition it only on a single layer of the network,
you'll only have kind of one set of task-specific perimeters. Yeah.
[inaudible].
Yeah, exactly. So the output of that, um,
that part of the matrix is a linear transform on- on the one-hot which produces a vector.
Basically that- the- well,
the column of that matrix corresponds to the resulting vector, um,
which could give you as essentially as an embedding- a linear embedding
of that task. Yeah.
So ah, my understanding of [inaudible] is through [inaudible] of these other points.
Right? But if you have the, um,
z i [inaudible] one linear versus multiple layers.
Um, in any case it doesn't really affect the weights [inaudible] um,
could you go in a little more detail on what kind of differences those make in terms
of-of that [inaudible] was his first [inaudible].
Because they're not really affecting the weights per se.
Yeah, so if you add z i to every layer then you
are- you're increasing the number of parameters per layer of course.
And you're also, like the parameters that you're adding are all tasks.
Well, not there- they're not all task-specific,
there- they're mostly task-specific.
Um, maybe I [inaudible] it depends on the dimensionality of each vector.
It's somewhere around half- half of the new parameters are
task-specific and half of the parameters are shared.
Um, so as you add weights to neural networks,
they become more expressive,
as you add task specific weights then in some ways they've become even more
expressive because for different tasks that can represent completely different things.
[inaudible].
We'll get to that in a second.
Yeah. Okay. Um, so we've been talking a lot about shared parameters and,
ah, and task-specific parameters.
Ah, and one kind of alternative view on
the multitask learning problem is that you can split
your parameter vector into these shared parameters and toss
specific parameters which I'll denote as theta sh and theta i.
Ah, and then if you view, um,
if you view kind of that as your parameter vector, then your objective looks
something like this which is basically exactly the same as the previous objective.
Um, but it decomposes, um,
it decomposes, it makes it obvious that the task specific parameters are
only optimized with respect to the objective for that task.
Uh, whereas the shared parameters are optimized over,
um, for all i basically, optimized over all tasks.
Uh, so one of the things that's somewhat interesting ah,
to note based on both of the examples that we saw
that- the multiplicative gating example as well as the concatenation example,
um, is that choosing how you condition on z
in many ways is equivalent to choosing how and where to share parameters.
Um, yeah.
So essentially the, ah,
choosing how to condition on z corresponds to thinking about this sort of
optimization problem and thinking about where those shared parameters
should lie and where those task-specific parameters should lie.
Okay, um, so let's go
over a few more choices for how- how we might go about conditioning things.
So we saw how we could do concatenation based conditioning where we basically take
that task descriptor and concatenate it with
the features and then do a linear layer after that.
Um, another option to concatenation-based conditioning is to add the, um,
is to add the- the task vector to your- to your hidden units,
to your features where you basically have a linear layer, ah,
to make sure that those feet- that kind
of says that you can basically add it o-on in this-
such as the same dimensionality because
the dimensionality of z i may not be the dimensionality of your features.
Ah, you run it through that linear layer and then add it.
Um, Interestingly, I presented both of these as two different,
ah, two different options,
but they're actually just the same thing.
Um, so you can essentially view, ah,
additive and conde- concatenation based conditioning, um,
[inaudible] same so for example, ah,
on the left side of this figure you see that you're concatenating, um,
x and z and then you have this weight matrix that corresponds to the, ah,
corresponds to the kinda of the weights that are applied to each of
those layers and you can view that as you kind of apply to matrix.
Multiply it to those two parts of the- the vector,
you basically get um,
these two components that are added together.
Um, which illustrates how- how basically the right hand side of
this figure corresponds to the additive conditioning which is
equal to the left-hand side which is,
ah, concatenation-based conditioning.
Okay. Um, yeah.
[inaudible].
Yeah so this is assuming that these are one-hot vector.
[inaudible].
Yeah absolutely. So if you have more information about
how tasks are similar or dissimilar from one another,
then you can feed that into your network as opposed to z i.
Uh, interestingly, determining how similar tasks are- to two tasks are to each other,
is actually a big part of the multitask learning problem itself,
is kind of determining how you should be sharing, um, sharing content,
sharing parameters, sharing structure across these different tasks.
Uh, and in many multi-task problems you don't necessarily have that a priori.
Um, but there are- there certainly many examples and we'll actually
see a practical example where you do have
some information about the task and you can use that,
um, to condition your network on it,
ah, in a different way.
Okay, um, a couple other very common conditioning choices, ah,
this is again for the-the one-hot case where you don't
have a lot of information about how two tasks are shared.
One is kind of the- this- this form of multitask or
multi-head architecture where you have some set of parameters,
ah, and then you split off the network,
ah, into, ah, into different heads.
The citation here is for 2017,
that's just for the visual.
This has been something that people have used for a long time.
Um, and then one other i- interesting choice, er,
or one other kind of very common choice that you can do is
multiplicative conditioning which is basically identical to additive conditioning,
except instead of adding you multiply.
Um, so you project your task matrix, your task vector onto um,
a vector and then you multiply that onto your, um,
on the activations either in the input or,
um, at kind of intermediate representations of your neural network.
Okay, um, so why might this sort of multiplicative conditioning be a good idea?
There's actually a few different reasons.
So one is it's- its more expressive than additive conditioning.
You can represent multiplicative interactions whereas with additive things you can,
ah, you- it's- its much harder essentially.
Um, and the other reason is- if you think
about something like multiplicative gating that we talked about before,
this sort of gating is something that can very naturally be
represented by multiplicative conditioning.
Ah, and not just multiplicative gating,
but also things like a multi-head architecture, um, where,
ah, you want to basically be choosing which parts of the network
should be- should be used for different tasks.
So essentially there's sort of multiplicative conditioning where
you're multiplying this embedding onto- onto your activations,
allows it to- to modulate different features,
ah, such that you can completely turn off features,
you can completely- only use some heads for one task,
only use some parts of the network for one task versus another task. Yeah.
[inaudible].
That's a good question. I would say it's a very simple form of attention mechanism.
I think attention mechanisms usually involve dot products.
And in this case you're not actually summing across the dimensions,
you're just doing an element wise multiplication. Yeah.
[inaudible].
Yeah you can definitely do that and we'll see
an example of that in, in the case study actually.
Ah, and multiplicative conditioning allows you to do that.
Or it allows it to actually learn how to do that,
um, in a completely automatic way.
Yeah.
[inaudible].
Yeah.
[inaudible] wouldn't be as- as expressive as the previous architecture as you said
[inaudible] actually.
As expressive as the additive or
Or, yeah, additive with another linear layer or something like that that would actually use the input. [OVERLAPPING] Right so these diagrams had been,
ah, kind of looking at a single layer at a time.
Of course when you add neuron- add kind of layers,
a neural- neural network is a universal function approximator,
and so if you concatenate two things and then wanted it to give you the product,
a neural network can represent that function.
Um, and so technically the additive function
can- the neural network can also represent this when you concatenate.
It's just, ah, in practice and so technically all of
these are- are kind of all universal function approximators
for- in the multi-task setting.
Um, in many ways it's kind of more about what you're readily allowing the network to do,
um, and how, kind of the optimization process itself,
although it's much harder to say things about, ah,
optimization than, um, than about, ah,
the types of things that like an individual layer can represent for example.
Yeah.
[inaudible]
Can you repeat the question?
Do you have to know like if we already like the number that you only train on [inaudible].
Are there methods where you can actually use the training model for [inaudible].
A task that you haven't seen any training data for?
Yes.
So yes, so that's a good question.
Um, in that case if you want to be able to
generalize to an entirely new task that you haven't seen during training,
you'll need something other than a one-hot vector,
uh, because if you only have a one-hot vector,
then you are, uh, you don't know how
that new task relates to anything that you've done before.
But if you have a language description for example then,
um, then in principle these,
these models can learn how to generalize
from basically given a new language description to perform that task,
given enough data of other language descriptions that share,
uh, that share commonalities with what you're seeing at test time.
Um, and then in the second part of this lecture
we'll be talking about how we can quickly learn new tasks from data,
uh, by using experience on tasks, on previously seen tasks.
Okay. Um, there are also more complex choices,
uh, that we can go with as well.
So here are some architectures from, from the literature.
For example these things include various modules, various, um,
attention, uh, components, uh, different gating mechanisms.
We don't have time to go through,
uh, all of the possible choices that you do here.
Uh, um, although some of the,
I think one or two of the, um,
of the readings that we'll have will cover a couple different options there.
Um, yeah, so the main takeaway though that I'd like to give is that there are,
well, I guess one, there are a lot of different appro- like a lot of
different choices we have to make when choosing the architecture.
Uh, and unfortunately, these design decisions
are basically kind of corresponding to neural network architecture tuning.
Um, in that they are problem dependent.
So one architecture will work well on one problem but
won't necessarily work well on another problem or another set of,
um, another set of tasks.
Uh, in many ways they're largely guided by intuition or knowledge of the problem.
So if you know that two tasks are similar,
you might have them share more.
Whereas if you know two, that two tasks are different,
you might have them share less.
Um, and I would say that currently is in some ways more of an art than a science, uh,
and there, um, there isn't kind of like,
uh, any guide that can tell you exactly how to, uh, how to do this.
Given a new problem- a new set of tasks that you want to be able to learn.
How- but hopefully in- this a very active area of research and hopefully in the
upcoming years we'll have better, a better idea for how to do this.
Okay. So now that we've talked about the model architecture a lot,
what about actually optimizing this objective?
Um, in many ways,
this is- it's fairly straightforward compared to standard neural network training.
Uh, kind of a very basic version of optimizing this objective,
would correspond to first sampling a mini-batch of tasks.
So we have different tasks.
We used sample different tasks and then sample data points for each task.
So, um, unl- unlike the standard supervised setting,
we won't just be sampling kind of sampling once, we'll be sampling twice.
Uh, then we're going to compute our loss function on our mini-batch.
Uh, by kind of measuring a stochastic, uh,
basically kind of getting a stochastic gradient rather than a,
uh, rather than the true gradient because we
can't evaluate the gradient on our entire data set.
Uh, we back propagate the loss, uh,
to compute the gradient and then apply the gradient to your parameters using your,
your favorite neural network optimizer.
Um, so basically this is the same as standard single task learning except that we
will be sampling a batch of tasks and a batch of data points for each of those tasks.
Um, and note that kind of these first two steps is to ensure
that the tasks are sampled uniformly,
uh, regardless of data quantity.
So if you have one task that has a lot more data than another task,
then this will ensure that you're sampling them at equal rates.
Which is what we're optimizing for,
uh, in- on the top.
If you care more about the task that has more data then, of course,
it makes sense to weight those differently,
um, and sample at different proportions as well.
Okay, um, and one other tip here is that,
uh, if your different tasks correspond to regression problems,
this isn't really necessarily specific to the,
the optimization algorithm but, um,
it's pretty important to make sure your tasks labels are on the same scale.
Uh, otherwise inherently your loss function is
going to have a greater scale for which the,
uh, the labels have a greater magnitude.
Okay. So now that we've gone over kind of the basics of, um,
of the architectures and the challenges, or
the architectures and the algorithms,
uh, let's talk about some of the challenges that come up.
So the, in many ways,
the, uh, the kind of number one challenge that I view with
multitask learning is the problem of negative transfer.
Uh, and this kind of symptom of this, uh,
is when like if you train independent networks and that's
doing better than your multitask learning method,
then that means that you're getting negative transfer.
It means that the, uh,
kind of data and training from one task is adversely hurting,
is adversely, um, affecting the training of the other tasks.
Uh, so for example,
uh, and this is actually a problem that is, is quite prevalent.
Uh, in some cases maybe you care more about
computational efficiency, it's not practical to train independent networks.
Um, but for example we're actually recently trying to write a paper on multitask learning.
Um, there's this common, uh,
multitask version of the CIFAR data set where you break it down into um,
into different tasks corresponding to the different kind of,
uh, upper-level categories of CIFAR.
Um, and we kind of evaluated all the, uh,
multiple state of the art approaches,
uh, for multitask learning, uh,
and got various levels of accuracy on this data set and
then we trained independent networks and it was,
uh, performing better than kind of all the recent papers.
Uh, so why, why does this happen?
Why does negative transfer like why are we
actually getting benefit from training across multiple tasks?
Um, one is optimization challenges.
Uh, so if, um,
if basically gradients of one task or interfering with the,
uh, with the training of another task that basically when you apply it,
you compute gradient for task one,
compute gradient for task two.
If gradient one hurts,
the weights for task two,
uh, then the optimization will be more difficult.
Um, and also tasks we learn at different rates.
So uh, if one task is learning a lot faster than another task,
you might end up learning, um,
one task very quickly and then it,
it might get stuck trying to learn task two because it's kind of already learned,
uh, something and it doesn't want to learn something else.
Essentially how the optimization gets stuck it's,
it's, it's, it's kind of in this local optimum.
Okay, um, and the other issue that you might run
into is maybe you have limited representational capacity.
So, um, multitask networks in
general often need to be much larger than single task networks.
This is, is very intuitive but it can be easy, uh,
easy to forget and in this case the, um,
the- if, if it isn't as large as it needs to be then you're going to see under fitting,
uh, and that's a kind of, uh,
a symptom of limited representational capacity.
Yeah. So that's a good question,
uh, about basically how to handle,
uh, like optimization challenges especially when one task has more data that another.
Um, one is kind of the thing that I mentioned before as you sampled,
sampled the tasks and basically bounced them such that they have the same amount of data,
such that it doesn't favor one task over the other,
um, because oftentimes if you don't,
if you don't do that, then it will learn the task
that has a lot of data and it won't learn the other one.
Um, people have explored curriculum learning techniques and we'll talk a bit about that,
um, later on in this course.
In general, the- I would say that
the optimization challenges aren't completely well understood.
Uh, we're still trying to understand kind of in general
the single task optimization landscapes let alone the,
the multitask optimization landscapes.
Um, I do think though that there's been a lot,
like I'm actually pretty optimistic about, uh,
being able to understand these sorts of challenges very soon, um,
and we may even like later in the course cover, um,
try to get a better understanding of
these optimization challenges and how to mitigate them. Yeah.
So I'm confused about how you actually define this negative test [inaudible] because,
um, because we don't have perfect overlap
between the different tasks otherwise they would be different so.
So when you have one network versus,
um, like say 10, uh,
different networks rating out different tasks,
assuming that all the training are like perfect to the network's capacity.
You then, uh, you got to have like
worse performance with a single network versus like all 10 networks,
right, because you have, there's like
extra information that you know, maybe the network isn't going to be able to express.
You expect larger networks but how do you compare if your network is larger?
So you're saying that you're like,
why would you expect positive transfer? Why would you-
Yeah.
Yeah. So one reason to- one situation to expect
positive transfer is when you don't have a lot of data per task, um,
and so if the tasks are related, uh, enough,
the features learned for one task will be
useful or the representations learned for one task will be useful for the other task.
Um, so it's essentially a way to share data, um,
and as I'll talk about in like two slides, um,
you can basically kind of,
another way to view multitask learning is essentially a form of regularization,
uh, and the if,
um, if you're kind of problem setting,
if regularization would be helpful basically, uh,
and more data would be helpful then those are
some of the settings where you definitely expect positive transfer.
[inaudible]
Generalization essentially, yeah. In reinforcement learning,
we'll see other benefits as well relating to exploration. Yeah.
Um, so if all the tasks have exactly the same data,
is it usually [inaudible] to train independent networks?
That's a good question. So you mean all the same inputs,
but not- but different labels?
Um, all the- all of them had the same input but I
want to use the same input with the different things.
Yeah. So in that case, um,
you can definitely still see benefits from multitask learning.
Essentially, each task has its own form of supervision, and, uh,
when I was talking about data sharing,
it's- it's- it's not necessarily actually about sharing inputs,
it's about sharing supervision.
Uh, and so if you have, um, basically,
each of those tasks correspond to different amounts
of supervision and that those can be used for building,
uh, more flexible representations.
More powerful representations that can be used for,
um, for better solving one of your tasks. Ah, yeah.
So if you wanted to move [NOISE] problem data transfer, but still, uh,
like maintaining their leverage their representations to learn one task.
Uh, could you just do like
a simple transfer learning where it most likely to do that when, uh, kind of [inaudible]
Yeah. You can definitely do that.
Uh, in practice fine tuning.
I'll talk a bit about fine tuning in the next lecture.
In practice that makes, um,
f- fewer assumptions about the availability of data,
because you assume that you can't, kind of,
access that first data set,
and so in practice many times.
If you have both data sets available,
you- you should use both of them, uh,
and you'll be able to get better performance. Yeah?
Is there a simpler kind of like similarity metric that can you use,
before you do this kind of training.
Or do you just have to form
the optimizations to see if it performs well and that is sort of to measure,
how much shared structure there is?
Yeah. That's a very good question.
Uh, my- I don't- I don't think that this- this question is necessarily known,
and I think there maybe are techniques where we
could better get a sense for this before training.
Um, but my sense is that,
how similar two tasks are is intimately tied to how easy it is to learn them together.
Uh, and so, I guess the pessimistic side of me would
answer that it's actually not possible to
a priori know how much you should share ahead of time.
Um, but I think it is kind of-
the more optimistic side of me says that it kind of
the optimizer should be able to figure it out.
Throughout the- like in an online fashion throughout optimization,
um, during the learning process.
So I don't think we can probably do it before learning but I think it can
happen during the learning process. Yeah?
Yeah. I wanna understand better why, uh,
Multitasking networking is larger networks.
So assuming we have multi- like the same network, um,
but it independent on train on different tasks,
and it works okay. But [inaudible].
It seems to me that [inaudible].
Because they're sharing the same like activation is
similar to the [inaudible].Or else in bigger networks would just be,
um, the backpropagate gradients, might be
more sparse within the network so the drug can't sell as much.
Yeah, so it certainly could be that larger networks help
because they alleviate optimization challenges.
Um, people have observed in general in single task learning that if you have
a larger network they actually sometimes optimize faster than the smaller networks.
Um, but separate from optimization,
there is also just the representational capacity of like.
If you're doing a single task that's one function and if you're doing another task,
assuming that task is different.
Uh, the representational capacity needed to represent both of those functions,
is greater than just an individual function.
But it may be that actually the optimization challenges
potentially could even have a- a greater impact
on- on this sort of, uh, observation. Yep.
So sometimes, um, when we test the machine learning algorithms,
we sort of generate the data set where we know Bayes error.
So its like w-we know how well an algorithm in principle could do.
Are they similar to sort of word problems in this setting where
beforehand we actually know how much positive transfer is possible.
Not off the top of my head.
But there- there certainly could be literature
on that given the extent to which people have studied it.
Um, yeah, feel free to like post that on Piazza or something.
And we can try to find something.
[OVERLAPPING] I'm trying to look for that particular,
because it's been hard to even understand like
because then would be more obvious to find some clear example
Here there will be. Yeah, I do suspect that we will
be able to find an example with real data, uh, in real problems.
[OVERLAPPING] But I think it will generate the,uh-.
Yeah, I think that- that would be interesting. Yeah.
Uh, on the note of like meeting, you know,
obviously all those [inaudible] .
Um, If you have like records,
in vectors at every single stage in
the network and it's multiplying all these parameters,
like, some values that will typically be a hidden value.
Will that be equal to like in,
you know the equivalence [inaudible] we'll have
the equivalence rendered to like these different networks.
Um, in principle yes,
if you use mostly the kind of multiplicative conditioning.
Um, so in principle it does have
the representational capacity to represent completely separate networks.
Um, and that's- that actually is one of the good reasons to use
that sort of gating, the, um,
in practice the optimizer might have challenge finding that- that solution.
Um, yeah, and so it may not actually be able to- to find that solution.
But technically it should have that capacity.
Okay.
And I guess one thing I will add to that is we actually have observed very recently that,
uh, in many cases it's actually very
challenging for the optimizer to find that solution, um,
in tasks where the- where the type,
the things that you're learning are very distinct from one another. Yeah.
Um, if we have for example, uh,
like one neural network sharing one task and also sharing another task.
Shouldn't we combine them so that, uh,
it becomes a multitask network?
So you- you can combine them but very naively and
stupidly by just kind of using a task selector to pick which one you want,
what network you want to use.
Um, to my knowledge there aren't any techniques, well.
You actually use overlap so you get some advantage [NOISE]
I don't know if any w- uh,
we'll cover this when we get to reinforcement learning.
Yeah. So there are some techniques that- that do kind of take two
trained networks for tasks, and then try to learn a single network,
um, that fuses them in some way.
Okay, I should probably get onto the next few slides.
So, um, the takeaway here is
you I- I- I cannot - don't even know if I've even said this yet
but I think all of you have inferred it is that if you
have negative transfer you can share less across tasks.
Um, tra- train a bigger network, um,
and one of the interesting things here is actually sharing versus not
sharing parameters is not just a binary decision.
Uh, you could actually also,
uh, do what's called soft parameter sharing.
Um, and so what this looks like or one version of this is where you have, uh,
you can break down your objective as I showed before into
shared parameters and task specific parameters and then
basically add this term that
encourages the task specific parameters to be similar to one another.
And you could do this at different layers of the network.
You could do it, um, for different pairs of tasks,
for some pairs but not other pairs.
You could also do it with respect to a single set of global parameters and tie
them all to a single set of global parameters
instead of doing this sort of pairwise loss.
Uh, and so what this looks like is you have two separate networks.
Um, and what you do is you kind of softly constrain the weights to be
similar to one another rather than actually having them represent the same exact value.
Um, so one of the benefits of this is it allows for
more fluid degrees of parameter sharing.
Uh, where if you set the weight on this loss this-this right hand loss to 0.
Then you have zero parameter sharing.
If you set it to be a very high, uh,
value such that- such that essentially constraining the optimization,
to set them, uh, such that they are equal;
then that corresponds to hard parameter sharing.
Um, the downside is that it is kind of yet another set of decisions and hyper parameters,
that you need to tune, uh, and- and make sense of when you're,
uh, thinking about your- your algorithm.
Okay. Um, and so now that I've talked about negative transfer, um,
another problem which is in- in many ways somewhat orthogonal to
this chall- or kind of the opposite of this challenge is the challenge of overfitting.
Um, and so if you observe that if you have only a small amount of
data per task and you find yourself overfitting to those tasks then,
uh, it actually may be that you're not sharing enough.
Um, it may be that you just need more data,
but it may be that, if you, um,
shared more that would act as a stronger form of regularization.
Um, and you may be able to overfit less. Okay.
So now, we wanna go into kind of a case study of actually applying
these sorts of multitask learning algorithms in practice,
and so this was actually a very recent paper that came out, um,
in the past two months and they were looking at,
uh, making recommendations to users for YouTube.
Uh, and it seems actually it's- it's from Google and,
uh, to my knowledge,
this is- some of kind of what we'll be covering here are some of the decisions that they
have made would actually making recommendations to users.
Um, yeah so here's an example.
You have, uh, you're kind of watching this, uh,
this lecture on deep learning and then on the right,
you might be getting some recommendations about,
uh, what video to watch next.
So there are a couple challenges that come up,
um, in this sort of, uh, problem.
And the way that they formulate it as a multitask problem is predicting, uh,
is to making predictions about user engagement and about user satisfaction.
Uh, so you essentially have these conflicting objectives
to think about should I generate,
uh, a video with that, uh,
a user will rate highly, uh,
can I predict whether or not the users will share
that whether or not they'll watch that video, um,
and the types of things that you share are not necessarily
the types of things that you would watch for a long period of time for example.
Um, the second challenge that comes up that I
won't have a lot of time to talk about today, but, um,
is also a bit interesting is that in
these sorts of recommendation systems you also have a form of
feedback loop where users may have watched something simply because you recommended it,
uh, and this leads to bias in your data.
Uh, and so they also have an interesting technique that, uh,
they used to address this bias, uh,
and if you want to kind of find
out more information about that, you can look at the paper.
Um, Okay.
So how do they actually set up this problem in this framework?
So the input is what the user is currently watching which is called the query video,
and they also know some features for that user.
Uh, and then what they do is they first generate candidate,
a few hundred candidate videos that they might think that user might wanna watch,
they then rank the candidates, uh,
and then they served the top ranking videos
to the user in that recommendations bar on the right.
Um, the candidate videos, um,
this wasn't really the main focus of the paper,
but the way that they do this is they pull videos from
multiple- multiple candidate generation algorithms.
These algorithms do things like try to match
topics of the video that you're currently watching.
Um, they also try to, uh,
look at the videos that are most frequently watched
together with the video that you're currently watching,
um, as well- as well as a number of other,
uh, forms of metrics,
uh, and then kind of the central topic of- this paper was given,
uh, a large number of candidate videos,
can you rank, um,
can you figure out what basically rank the- the
that those candidates into an ordered list,
uh, where the top things correspond to the ones that you actually want to serve.
Okay. Um, so how do you actually formulate the  ranking problem?
Uh, what this looks like?
Again, similar to before,
your input is gonna consist of the query video and features of the user.
It's also gonna have- the input is also gonna be one of
the candidate videos so there's gonna be operating on each of the candidate videos.
Um, and interestingly in this case,
things like the query video and the candidate video,
uh, these are actually literally, uh, the videos,
they correspond the information about those videos like, um,
features of a video such as,
um, the to- the topic, the title, um,
the upload time, uh,
and then also user features like, um,
user and context features such as their user profile,
the time of day, um, and a variety of other things.
So this includes both, uh,
textual data as well as non-textual data that they u- have
used kind of different shared embedding network
to embed into a common representation space.
Then the output of this model is to determine the engagement and
satisfaction of the candidate video for the user after watching the query video.
Um, and so engagement include things like whether or not they clicked on the candidate,
uh, that this, er, video,
whether or not they, uh,
spent time, uh, watching that video.
So this is both classification and regression.
Uh, and the satisfaction corresponds to things like clicking like on that video,
uh, as well as,
um, rating that video for example.
So again, this is a both a kind of a mixed discrete and continuous problem.
Uh, and then kind of the big question is how do you figure out the score from these,
uh, from these different metrics and, uh,
what they do is they actually just use a very simple weighted combination of
these predictions where the weights of those predictions are just manually tuned.
Um, and the- the reason why it, the kind
of the motivation that they had here is that the- the system is very complex, uh,
itself and also the, um,
the metrics for like for whether or not, um, for whether or not,
you're actually serving good videos are online metrics and they're ones that you cannot,
um, they are very difficult to optimize with respect to.
And so by doing this in a way that's kind of done manually,
um, it's a bit simpler,
it- it- it's easier to kind of, um,
to- to iterate on the- on the model essentially.
Okay. Um, so let's talk about
the kind of how we actually predict engagement and satisfaction.
So, um, they considered some basic option,
which corresponds to the multi-head architecture that we
talked about before where you have some shared layers and
also some layers that are predicting engagement and, uh, and satisfaction.
This was kind of their- their baseline.
Um, and one of the things that they
found is that when the correlation between the tasks is low,
that will harm the learning process.
As we've kind of talked about before it's sort of negative transfer, right?
Um, and so what they did instead, uh,
is they used, uh,
a form of soft parameter sharing.
And this looks like this,
uh, where I'll go,
I'll go through the details kind of step-by-step.
So you have some shared bottom layer,
um, and the goal is, uh,
the kind of a sort of mixture of experts model is to
allow different parts of the network to specialize.
So, um, you have these different expert neural networks,
uh, that- that you want to be specialized in different me- in different ways.
Uh, so you kind of- you can see, this example has two different experts.
Then you want to decide which expert to use for an input and for a given task K. Uh,
and so what this corresponds to is you have is a different set of, uh,
weight matrices for each task and you have,
uh, an input and you pr- uh,
do a linear combination and a softmax to produce, um, this, uh,
this distribution over which expert to use and
then you compute the features from the selected expert,
so you, um, there's this bottom layer corresponds to the,
uh, uh, sorry, the- the- the expert corresponds to fi of X.
You then, uh, multiply this with the gating function in order to produce the,
uh, the kind of the selected expert output.
Uh, and then lastly, you compute the output of this.
So, uh, then you take another neural network h that's gonna
take the output of the expert and produce the result.
Um, so this is an example of an architecture that starts out
shared, branches out into multiple- multiple experts where it actually made,
it may use multiple or it may use just a single expert and then comes back to a single,
um, part of the model.
Okay. Um, so in the experiments,
the- the set up here was actually quite interesting.
Uh, they used TensorFlow and TPUs as you might expect.
Um, they trained in temporal order where they're feeding in the videos,
uh, over time, uh,
and training continuously to consume
the newly arriving data and they said that this is really important so that, um,
because the distribution over time of use- user engagement is changing,
um, as, um, as different trends become popular, et cetera.
Uh, they had various offline and online metrics.
For offline, they looked at AUC for classification tasks
and squared error matr- metrics for- for regression tasks.
Uh, but the most important things were actually
the online metrics because those were actually, um,
the ones that you ultimately care about, uh,
and so they did AB testing, um,
in comparison to their production system.
Uh, and the, uh,
the live metrics for this online testing
corresponded to things like the time spent on, um,
on YouTube, uh, survey responses that
different users filled out and the rate of dismissals,
which I- I'd guess would kind of, um,
relate to whether or not they're actually kind of dismissing suggestions or,
uh, by the recommender system.
Uh, and then the last thing
worth noting here is that because this is a massive system, uh,
computational efficiency also really matters
and so this is one of the reasons why they didn't
consider independent training and why they really
want to do multi-task training such that they didn't have to,
uh, ha- basically compute everything separately for all of these different metrics.
Okay. Um, so here are the results.
So the- the top two rows are showing the-
the baseline network that had- that corresponded to
a multi-head network and the bottom two are corresponding
to the mixtures of- mixture of experts, uh, architecture.
What we see is that there's a a substantial increase
in engagement and in satisfaction, uh,
at around 0.4% and 3% for the larger mixture of experts model,
which is, uh, which is nice to see.
Uh, and also, they looked at how different tasks are utilizing the different experts.
So each of these bars,
um, correspond to a different task.
So you can see that expert 7 is typically doing things for task 4 and for task 1.
Whereas, expert 2 is typically, uh,
computing features for task 2 and for task 3.
Um, and then lastly, they found that 20% of
this sort of gating polarization where each,
uh, expert would, uh, would kind of, uh,
you- you would basically actually like not use an expert
at all and what they found is that,
um, if they use some sort of drop out such that, um,
to prevent it from like only using kind of
one expert for all the tasks so we are not using one expert at all, um,
this was- was helpful for mitigating that .
Okay. Yeah?
Why would you not [inaudible] ?
Sorry, can you repeat the question?
If, if for example one task was different but the other one for some-
for some reason given to chose for specific action.
Yeah. So I think that what they mean by polarization here is that they're
actually using one expert for all of the tasks,
um, or not using one expert at all.
Okay, um, let's take a quick like one-minute break.
Uh, you guys have been asking a lot of questions,
which is awesome, uh,
and then we'll try to get into some meta-learning.
But [inaudible] in the, in the [inaudible] Do they have any way to compare how
well these approaches, like some [inaudible].
Give an example of the other recommenda- recommender system.
Let's say, um, somewhere I kind of agree [inaudible].
I don't have a sudden- in reading the paper,
I guess I didn't have a sense for if they did a comparison of
this approach versus other things like collaborative filtering.
Um, but I guess I would encourage you to take a closer look at the paper to see if they,
they talked about that [NOISE].
Okay. Let's get started on the second part.
Um, so let's talk about meta-learning.
We may not have time to get through everything,
but if we don't then I'll,
um, I'll cover what we didn't cover at the beginning of lecture on Monday.
Well, um, so to start off, um.
In many ways I think that are kind of two ways to view meta-learning algorithms.
Uh, one is somewhat of a mechanistic view,
and another is somewhat of a probabilistic view.
So the mechanistic view is a view that I think is very helpful for figuring, uh,
out how you actually go about implementing one of these algorithms,
and really understanding kind of the underlying mechanics of,
of how things work and how the implementation works.
And the probabilistic view I find to be very helpful for understanding,
um, conceptually and intuitively what these algorithms are doing.
Uh, and so the,
the kind of mechanistic view is that you kind of have this deep neural network that
reads in datasets and make- makes predictions for new data points,
such that when you see a new dataset and you can make predictions for new data points.
Um, and training for this network uses a dataset of datasets,
um, each for a different task.
And, yeah, I could have- as I said this makes it
easier to implement meta-learning algorithms,
uh, whereas kind of at a high-level the probabilistic view is looking at, um,
meta-learning as learning priors, uh,
essentially learning priors over tasks tha- in a way that
enables kind of combining that prior with the small amount of
data for a new task in order to more efficiently learn.
Um, yeah, so kind of learning a task for new prior uses a small amount of data.
Uh, and I find that kind of thinking about it from
a Bayesian perspective can make it a bit easier to understand what's going on.
Uh, so I'm going to start with the probabilistic view, uh,
to give some intuition for what these algorithms are doing,
and then, uh, I'll go into
the more mechanistic view of actually how you implement these algorithms.
So for the first part,
don't worry about how you would actually go about implementing things.
Uh, it's more about some of the,
the conceptual aspects of things.
So, uh, we're gonna kind of go back to problem definitions here, uh,
and kind of redefine things from
a Bayesian perspective in order to view meta-learning as a Bayesian perspective.
So, uh, you can view supervised learning as, uh,
a maximum likelihood problem,
where you're maximizing the likelihood of your parameters given your data.
Uh, and in this case Phi is gonna be the model parameters and D is gonna be the dataset.
Uh, and of course,
it's kind of as before the dataset contains x-y pairs where you have inputs and labels.
Um, and you can also,
uh, view this, uh,
kind of redefine it as maximizing the probability of the data given your parameters,
and maximizing the marginal probability of your parameters.
Um, so the first term is your data likelihood,
and the second term may be something like a regularizer,
like something like weight decay for example,
which corresponds to putting a Gaussian prior on your weights with a fixed variance.
Okay. Um, and so then what this looks like is you have, uh,
some optimization over your dataset,
uh, and given your weights and then you have your regularizer.
Um, unfortunately if your data is very large,
uh, as we kind of, uh, if you're, sorry,
if your dataset is very small as we talked about,
um, then you might overfit to that dataset.
Uh, and even if you have a regularizer there,
it may not be enough to, uh,
prevent you from overfitting or it may, uh,
not- it may kind of not,
not be informative enough of actually
what you should be learning or kind of the model that you're trying to learn.
Um, so the question in meta-learning,
and kind of the key problem that it's trying to answer
is when we are solving supervised learning problems,
can we incorporate additional data in some way?
Like maybe we're not learning from scratch,
maybe we have experience from,
from before that we've,
uh, kind of, we've experienced the world before at this point.
Um, and that's what it's gonna be called the meta-training data.
We want to be able to use the meta-training data such that when we perform a new task,
we can, uh, learn it more effectively and more efficiently.
Uh, and in particular in,
in meta-learning, we- this additional data
is actually going to correspond to a set of datasets.
Uh, basically a set of tasks or datasets that's corresponding to tasks.
We wanna be able to use those datasets such that when we see a data for a new task,
we can learn parameters for that task.
Um, and then of course each of these individual data point- each
of these individual datasets will have k-data points.
Um, so for example,
if you want to solve a few shot classification problem,
maybe you're given five images and you wanna be able to classify new images,
uh, as being among one of those five classes.
Uh, if you try to train a neural network from scratch on this problem,
it will overfit massively or if your regularizer is too strong, it won't do anything.
Um, but if you have data from other image classes,
uh, like shown here,
then, then this is your meta-training data.
Then you may be able to use that meta-training data in a way that allows you to learn,
uh, from only five examples in a manner that is effective.
Um, and of course this doesn't need to be image classification problems.
This could be, um,
different kind of sinusoidal prediction problems.
It could be, um,
really a variety of other problems that,
I do language classification or, or kind of,
yeah, uh, decoding text for example, etc.
Okay. Um, so that's kind of the problem that we wanna solve, um,
but what if we don't wanna keep her- keep running our past experience forever.
What if we, uh,
if we're gonna be training on,
on all our past experience across tasks then,
um, we don't wanna assume that we
have to access that every time we want to learn a new task.
Uh, and so what we're gonna do is we're gonna try to kind of compile
our meta-training dataset down to a set of parameters Theta,
which I'm going to call the meta-parameters.
Um, and Theta is basically gonna correspond to whatever we
need to know about the meta-training data in order to solve new tasks quickly.
So, um, basically if you kind of
view the- if you want to introduce this kind of intermediary variable, um,
if we write out the likelihood of our parameters for
our dataset given our past data- our past meta-training data,
we can write this out, uh,
in a way that integrates over,
uh, our meta-parameters Theta.
Um, so we first and for our meta-parameters data given our meta-training data,
and then we can make predictions of our, our,
our prediction about our kind of
our task parameters given our data and our meta-parameters.
Uh, and note that this assumes that the, um,
our task-specific parameters and
our meta-training data are conditionally independent based,
uh, conditioned on Theta.
Uh, and we can basically approximate this,
uh, integral with- this is a fairly crude approximation,
but we can approximate this integral with
a point estimate for our meta-parameters, where the first,
uh, the- kind of the right-hand side of this is going to correspond to meta-training,
where we wanna learn a set of meta-parameters given our meta-training data.
And the left-hand side of this objective is gonna correspond to
adaptation where we wanna be able to learn new parameters for a new task,
given data from that task and our meta-parameters Theta-star.
Okay. Um, where Theta-star is equal to basically,
uh, argmax of log p Theta given our meta-training data.
Okay. So essentially this right-hand side is the meta-learning problem where we're
optimizing over our meta-parameters such that when we use them for,
uh, for adaptation, we can,
um, effectively learn parameters for that task.
Okay. Um, so meta-training is that cap- that second term and,
and, and adaptation is that, uh,
is the term now outlined in red.
Uh, let's first go over,
how we can go about adaptation.
So we can, kind of, look at a very simple example,
and we'll revisit this example,
um, at the end.
Uh, so say, we wanna be able to make predictions, about test data points.
Um, the parameters corresponding to this neural network will correspond to Phi star.
They'll be the parameters for that task.
Uh, and we want to be able to infer these parameters from data.
So [NOISE] one very simple example,
for how we might put- uh,
infer these parameters from data,
is by training a neural network to take in,
the dataset and output Phi star or Phi star for that task.
[NOISE] Uh, and so uh,
where each of these examples correspond to our dataset.
Uh, and so essentially,
what this would correspond to is,
um, we would have, kind of,
these different meta-training datasets [NOISE] for,
for this dataset D, we would pass them to the neural network.
We would then give it a, uh, a new data point,
and we wanna be able to have it make a prediction for that data point,
uh, based off of the computed task specific parameters.
Um, so you add this to your test input and then your inpu- uh, your test label.
Now, uh, one big question is okay,
we have this- this is, kind of, how we're gonna do the adaptation process.
How do we actually, go about meta-training this, uh, this model?
So, um, how do we actually basically,
learn the parameters of our recurrent neural network theta star,
such that it produces the right task specific parameters Phi star?
Um, and so the key idea here is we can, uh,
basically, match test-time to training time.
So the test-time, we're gonna be predicting Phi star,
in order to make predictions about test data points.
We want to be able to do the same thing in meta-training and basically,
train this recurrent neural network,
such that it produces parameters that lead to effective predictions or,
or, or accurate predictions.
So essentially, the key idea is that,
uh, train and test conditions should match.
Um, that's basically like the,
the principal rule of meta-learning basically.
Uh, if you wanna be able learn at test-time,
you should basically train it to learn during meta-training.
Essentially be learning how to learn.
Um, so and then you can,
kind of, view the learning process, uh,
the adaptation processes at
sort of, meta test-time,
uh, if, if the former process is meta-train time.
Okay. Um, right, so the process of optimizing for theta is meta-training,
the optimize for, for producing Phi is meta-testing.
Um, now, one of the things
that's pretty challenging if at meta-training time you're actually training it,
to make predictions about test data points,
um, where do these data points come from?
Uh, how do we actually,
optimize for it to make predictions on new data points.
Uh, and the important thing here is that, uh,
we need to be able to essentially reserve a test set
for each task that we're gonna be meta-training over.
Uh, so if we have, uh, a set of tasks,
then we have basically trained data for each task,
we also need corresponding test sets for each of those tasks.
Uh, and these test sets, uh,
are going to be held out images,
from those same image categories.
Now, you might say, okay now you're gonna be testing-training on the test set.
This doesn't seem right? Uh, [LAUGHTER] uh,
[NOISE] and that is- that's a good intuition to have.
[NOISE] But, uh, in this case,
we- now- are now moving from meta-training- from training set to the meta-train sets,
and test sets to meta test sets.
So each of these tasks,
the training set and the test sets for the task correspond to the meta-training dataset.
And then at meta-test time were given new tasks.
Uh, and we don't wanna train on the meta-test set.
Okay. Um, so we have our meta-learning data,
this corresponds to training and test sets for every task, uh,
where each of the training set- datasets corresponds to K data points.
Each of the test data- datasets correspond to a new set of K data points.
Um, yeah.
Okay. So the complete, kind of,
optimization problem is that at
test-time we're gonna be inferring a set of task specific parameters.
Which can be some, some function that takes as input
the training dataset and outputs the task specific parameters.
Where the parameters of that function or the meta parameters are theta star.
Um, and we essentially wanna learn a set of meta parameters such that,
this function is good for held-out data points,
after being- after ge- getting the training dataset as input.
Okay. Um, so essentially you can view theta star as optimizing this,
uh, objective [NOISE] where we want to optimi- optimize the,
the probability of the parameters,
um, being effective at new data points.
Okay. Um, cool.
So you could also,
look at this as a graphical model,
where theta is essentially your prior,
Phi i is your task specific parameters for a given task,
and you have a training set and a test set for,
um, for each task.
So Phi is, um,
Phi leads to, uh,
the data and theta acts as a prior on the task specific parameters for all the tasks.
So theta is essentially the information that's shared across tasks.
Okay. Um, and then this dashed line does means that we don't know the,
the kind of test labels for new tasks at meta task time.
But we do know the training dataset.
Okay. Um, some terminology
before I get into some of the more mechanistic view of meta-learning.
So, um, I talked about this.
We call- I call this the meta-train dataset.
Um, each of these- each of row- each of these rows is a meta-training task.
Uh, and each of these tasks has a corresponding train set and test set.
Uh, the, the left side of these can be viewed as the training data,
and the right side can be viewed as the test data.
Uh, but another common terminology for this,
is support set and query set where you have some set of
support images or training images and some query images or test images.
Uh, and f- if you have a K-shot learning problem or a few shot learning problem,
K is the number of data points in your training dataset.
So if you're doing one-shot learning you're doing- you're learning from one example,
if you're doing K-shot learning you're learning from K examples.
Okay. Um, how does this relate to other problem settings?
So multi-task learning like we talked about at the beginning of this lecture,
you learn a model with,
um, with a single set of parameters that solves multiple tasks.
Uh, you don't care about generalizing to new tasks.
[NOISE] Uh, and this essentially can be
seen as in somewhat of a special case as meta-lear-
as meta-learning where your task specific parameters and your, um,
your prior parameters are the same, uh,
such that you just, um,
kind of, have a single network that is representing all of the tasks.
Uh, and in many ways multi-task learning is a prerequisite
for meta-learning because if you can't solve the training tasks,
then you have no chance of being able to learn new tasks, uh,
more efficiently than, um,
than like learning from scratch for example.
Um, it's also related to things like hyperparameter optimization and, and auto-ML.
So, uh,
in hyperparameter optimization you can essentially view
theta as the hyperparameters and Phi is the network weights.
Uh, and in architecture search you could use theta as the architecture,
Phi is the network weights, uh,
and both of these are a,
kind of, a very active a-, uh,
areas of research and outside of the scope of this class, of this tutorial.
Okay, um, I think we're mostly out of time,
and so I will cover the next two parts of the lecture in the next lecture,
and we'll probably push back the,
um, the due date for the homework assignment, uh,
probably like a couple days, such that we,
um, you guys have enough time to complete it.
Um, are there any questions about what I covered with
meta-learning before we wrap up? Okay [NOISE].
A couple quick reminders before you leave the,
um, Homework 1 is posted today and is not due on Monday, October 7th.
Uh, fill out paper preferences by tomorrow, uh,
and the TensorFlow review session will also be tomorrow at 4.30 PM.
Bye. 
