Hey everyone. Um, let's get started.
So, um, let's see, the plan for the [NOISE] day is,
uh, we'll go over the rest of ICA, independent component analysis.
In particular, talking about CDFs,
cumulative distribution functions [NOISE].
And then, um, actually, uh,
let's do that later.
[NOISE].
All right. So the plan is we'll go over,
uh, the rest of ICA,
independent component analysis, and we'll talk a bit about CDFs,
um, cumulative distribution functions [NOISE],
and then derive the ICA model.
And uh, in the second half of today,
we'll start on the final of the,
um, interesting, four major topics of the class, which is reinforcement learning.
We'll talk about MDPs, or Markov decision processes, okay?
So to recap briefly,
um, we had- you remember the overlapping voices demo.
So we said that in the ICA problem,
independent component analysis problem,
we'll assume we have sources S,
which are RN if you have N speakers.
So for example, if this is speaker one's audio,
then at time T [NOISE], um, S, you know,
superscript parentheses T subscript 1 is the [NOISE]
sound emitted by speaker one at time T.
Sorry, I don't have the- that's interesting.
All right [NOISE]. Just let me go back over a little bit.
Um, and, uh, yeah, we're,
we're using sometimes I to index training examples,
and so the training examples sweep over time, um,
and sometimes usually I use I,
sometimes I use T, I guess in the case where, um,
the, uh, uh, the different examples
come from different points in time in your recording.
And what your microphones record is XI equals A of SI.
So just for now,
let's say you have two speakers and two microphones, in which case,
A will be a 2 by 2 matrix, and in the homework problem,
we have five speakers and five microphones
in which case A will be a 5 by 5 matrix.
We'll talk later, um, about what happens
if the number of speakers and microphones is not the same [NOISE].
And the goal is to find the matrix W, uh,
which should hopefully be A inverse, um,
so that SI is W times X recovered the original sources.
Uh, and we're going to use these W1 up through
WN to represent the rows of this matrix W. Yeah.
[inaudible].
Uh, oh yes you're right.
Thank you. Right. Okay. Thank you.
Okay. [NOISE] So, um,
[NOISE] last time we had [NOISE]. All right.
So remember this is a picture of the Cocktail party problem.
And, uh, last time I showed these pictures about,
you know, why, why, why is ICA even possible, right?
Given two overlapping, um,
voices, how is it even possible to separate them out?
How is there enough information to know,
um, uh, you know, what are the two overlapping voices?
And so one picture [NOISE] we saw was this one,
where if S1 and S2 are uniform between minus 1 and plus 1,
then the distribution of data will look like this [NOISE].
If you pass this data through the mixing matrix A, then your observations,
now the axes have changed to X1 and X2, may look like this,
and your job is to find an unmixing matrix W that
maps this data back to the square, okay?
Now, this example is possible because the examples- because the,
uh, sources S1 and S2,
were distributed uniformly between minus 1 and plus 1.
Um, it turns out human voices, you know,
the recordings per moment in time are not
distributed uniform between minus 1 and plus 1.
And it turns out that, um, uh,
if the data was Gaussian,
then ICA is actually not possible.
Here's what I mean [NOISE].
Let's say that, uh- so,
so the uniform distribution is a highly non-Gaussian [NOISE] distribution, right?
Uniform B minus 1 plus 1, you know,
this is non-Gaussian and that,
that makes ICA possible [NOISE].
Um, what if [NOISE] S1 and S2 came from Gaussian densities, right?
Um, if that were the case,
then this distribution S1 and S2 would be rotationally symmetric.
And so, um, there would be a rotational ambiguity, right?
Any axis could be S1 and S2 [NOISE].
You can't map, you know,
this type of parallelogram back to this square, right?
So, so [NOISE] you can't sort of I think in this parallelogram,
um, you can sort of lead off,
you know, that may be one axis should look like that.
So I'm drawing with a mouse, not doing very well.
Well, second axis should maybe look like that, right?
And by, by inverting that you can get the data back to the square.
But in the case of if the data look like this,
then [NOISE] you actually don't know, um,
because [NOISE] maybe this should be S1 [NOISE] and that should be S2, right?
But so this is rotational ambiguity,
because the Gaussian distribution is, um,
rotationally symmetric, if S1 and S2 are standard Gaussians,
then, then, [NOISE] then this distribution is rotationally symmetric,
and you don't have enough information to recover
the directions that correspond to the original sources, okay?
So it turns out that, um,
there is some ambiguity in the output of ICA.
In particular, last time we talked about,
uh, two sources of ambiguity.
Um, you don't know which is speaker one and which is speaker two, right?
You don't know which one to number speaker one and which one to number speaker two,
and you might take this data and flip it horizontally, uh, reflect this,
you know, on, on,
on the neg- S1 goes to negative S1 [NOISE],
or reflect this, uh, on a vertical axis.
We don't know if it's positive S2 and negative S2.
And in the case of this example,
where S1 is, uh, uniform minus 1 plus 1.
Those are the only sources of ambiguity.
Um, but if the data was Gaussian there would be additional rotational ambiguity
which makes [NOISE] it in part- whi- which actually makes it
impossible to separate out the sources, okay?
So [NOISE] it turns out that, um, all right. cool [NOISE].
So it turns out that the Gaussian density is the only distribution,
um, that is rotationally symmetric.
Uh, if, if, if S1 and S2 are
independent and if the distribution is rotationally symmetric [NOISE],
meaning that the distribution has sort of circular contours,
uh, then it, then it, then it must be a Gaussian density [NOISE].
And so, there is a theorem, uh,
which I'll just state it formally,
that ICA is possible only if your data is non Gaussian, right?
But, but so long as your data is non-Gaussian,
then it is possible to recover the independent sources, okay?
I'm just stating [NOISE] that informally.
Um, so let's [NOISE] let's see [NOISE].
So what I would like to do is, um,
develop [NOISE] the ICA algorithm assuming that the data is non-Gaussian, okay?
Now, um, [NOISE] in order to, uh,
develop the ICA model,
we need to figure out what is the density of S, right?
And I'm going to use P subscript S,
you know, of, uh, the, the, uh,
of the random variable S to represent the,
um, density of S. Um,
an equivalent way to represent the probability of
the density of continuous random variables [NOISE] is via CDF,
which stands for cumulative
uh, distribution functions [NOISE].
And the, uh, the cumulative distribution function of a, uh,
random variable F of S in probability,
is defined as the chance that the random [NOISE] variable is less than that value.
So I guess, um, notations have been inconsistent, sorry,
but this is capital S I'm using to denote the random variable,
and this is some constant.
Right, and, uh, it's that same constant as that lowercase s, okay?
Um, and so for example,
if this is the PDF,
of a random variable S,
maybe of a Gaussian, right?
The CDF is a function that um, [NOISE]
increases from 0 to 1 where,
um, the height of a CDF at a certain point is the probability.
So if you take the curves at the same point, right?
So the height of a CDF at a certain point lowercase s,
is the probability that the random variable
takes [NOISE] on a value equal to this value or lower,
which means that the height of this function is equal to, um, you know,
the probability mass, the area under the curve of your PDF,
um, [NOISE] over to the left of that point, okay?
So that's, uh, I don't know, sometimes this-
some probability and statistics courses teach this concept and some don't I guess,
but there's- so there's a mapping between the PDFs and the CDFs of a function,
of a, of a continuous random variable.
Um, and the relation between the PDF and the CDF is that
the density [NOISE] is equal to the first derivative, right?
Uh, F prime. So if you take the derivative of the CDF,
then you should recover the PDF, [NOISE] okay?
But so I think, um, in order to specify, you know,
some random variable, we could either specify the PDF, right?
The probability density function,
or you can specify the CDF which is just, you know,
let's tell me what's the chance of the random variable taking on
any value less than any particular value S. And by taking the derivative of this,
you can always recover the PDF,
and by integrating this you can always go to the CDF, okay.
And so, um, what we're going to do in, um,
ICA is instead of specifying a PDF for how speakers' voices sound,
we're instead going to specify a CDF, and, uh,
we'll have to choose a CDF that is not the Gaussian density CDF,
because we have to assume that the data is non-Gaussian.
Uh, um, uh, and the CDF, you know,
is a function that always goes from, right, 0 to 1, okay?
So, um, [NOISE].
All right. So we'll specify [NOISE].
So in a little bit, we'll specify some CDF
for the density of the sources of what human voices sound like let's say.
And if you differentiate this, uh,
you will get the PDF of the density of s, right?
Was equal to that. Now, we're um,
going to derive a maximum likelihood estimate mission algorithm in a minute.
But our model is [NOISE] that, X is equal to A_s, um,
which is equal to, I guess w inverse of s,
and s is equal to w_x, right?
So that- that's, that's the model.
And in order to derive a maximum likelihood estimate for the parameters, um,
when you have- [NOISE] so this is going to be the density of x.
Okay? So this is a relationship between, um, ah,
this is the relationship between x and s. X is equal to A_s,
equals W inverse s and s equals W_x, right?
So this is the model. And what I'd like to do is,
let's say you know what's the density of s. Um,
what is the density of x if x is computed as the matrix A times s?
Right? So one step that's tempting to take is to just say,
well, s is equal to W times X.
So the probability of x is just equal to
the probability of s taking on the certain value, right?
So, so I mean this is s, right?
And so the probability of seeing a certain value of x is equal
to the probability of s taking on that corresponding value Because assuming,
W is an invertible matrix, is a bijection.
There's one-to-one mapping between x and s. So to find the probability of X,
just find the probably of s and compute the corresponding probability.
Um, it turns out this is- this is incorrect,
and this works with probability mass functions, for discrete probability distributions,
um, that take on discrete values.
But this is actually incorrect for continuous probability densities.
So let me- let me, um, uh, show an illustration,
and we'll go back to derive what is a correct way of
computing the density of x. Oh, and we'll want,
uh, density of x because um,
when you get the training set,
you only get to observe x, and so for, uh,
finding the maximum likelihood estimate parameters,
you need to know, um,
what's the density of x you can map, you know,
choose the parameters, choose the parameters W that maximizes the likelihood.
Okay? [NOISE] So that's what we want to compute the density of x.
But, um, let's, let's use a simple example.
[NOISE] Let's say the density of s is a indicator,
s is between 0 and 1.
Okay? So this is, um,
s is distributed uniform from [NOISE] 0 to 1.
Um, and let's say [NOISE] x is equal to 2 times s. Okay?
So now notation, A is equal to 2,
[NOISE] W is equal to one-half.
Ah, this is, uh, n equals 1, 1-dimensional example.
So, um, this is the density of s, right?
Uniform distribution from 0 to 1.
And if x is equal to 2 times s,
then this seems like X should be equal- X is
distributed uniformly from 0 to 2, right?
Because if s is uniform from 0 to 1,
you multiply it by 2,
X is distributed uniformly from 0 to 2.
And so the density for X is equal to
this, 1, 2 [NOISE].
Right? And it's now half as tall because, uh,
probability density function z to integrate to 1, right?
So this is a uniform from 0 to 2 probability density function.
And so the correct formula, um, is P of x,
x equals one-half, times
indicator 0 less equal to x less than equal to 2.
Right?
Um, [NOISE].
Okay? And, uh, more generally,
the correct formula for this is actually this times, um,
this is the determinant [NOISE] of the matrix W.
Uh, and in the case of a real number,
the determinant of a row number is just this absolute value which is why,
um, we have the density of x equals one-half.
You know, that's the absolute value of the determinant of W, um,
times, times your- times indicator whether 2 times s is within 0, 0 to 1.
Okay? Um, yeah, right.
So I guess this, uh,
this is indicator 0 less equal to one-half x less or equals to 1,
right, since that's s. Okay?
So this is illustration showing why this is the right way with
the determinant of W multiplied here as the-
as a way to compute the density of X.
Um, and, er, for the- for those of you familiar with, um,
determinants and- oh, and determinants is
a function you can call you know, in NumPy to compute, um, ah,
but also, uh, the intuition of a determinant is it measures how
much it stretches out a, um, local walking,
and so you need to, uh, uh, er,
sort of divide by the determinant of A or multiply by the determinant of W,
um, in order to make sure your distribution still normalizes to 1.
Right? So that's where that comes from.
[NOISE] So, um, we're nearly done.
Just one more decision,
and then we can derive the maximum likelihood estimation,
uh, to derive a maximum likelihood estimate of this, of the parameters.
The last thing we need to do is, um,
[NOISE] choose the density of what your speakers' voices sound like.
[NOISE] And as I said just now, um,
what we're going to do,
is, uh, choose a non-Gaussian distribution.
Right? And so [NOISE] while F of s is equal to the chance of this person's voice.
Right? Random variable s being less than a certain value.
And we need a smooth function that goes between, you know,
0 and 1, um,
where, we need a smooth function that has vaguely that shape.
And so well, what functions we know that are vaguely that shape?
Right? Let's pick the sigmoid function.
Um, and it turns out this will- this will work.
Okay. There are many choices that actually work fine.
Um, it turns out that if you choose a sigmoid function to be the CDF,
then if you look at the PDF this induces,
if you take the derivatives of this.
Right? So take P of x [NOISE] equals the derivative of the CDF.
Um, it turns out that if this is the Gaussian,
then the PDF that this choice induces,
is, uh, something with fatter tails, right?
Um, by which I mean that it goes to 0, you know.
[NOISE] So Gaussian density
goes to 0 very quickly,
right, it's like e to the negative x squared, right?
That's the Gaussian is a square in the exponent of the density.
And it turns out that this particular density, uh,
taken by compute derivative of a sigmoid it goes to 0 more slowly and this captures
human voice and many natural phenomena better than
a Gaussian density because there are a larger number of extreme outliers,
that are more than one or two standard deviations away,
um, but there are actually multiple distributions that work.
You could- if you use a double, double exponential distribution.
So this is an exponential distribution- exponential density.
If you take a symmetric with two sided exponential density for P of s,
it will also work quite well for ICA.
But I think, um, early history of ICA,
you know, researchers, I think it was,
um, might been Terry Sejnowski, uh,
down at the Salk Institute,
just needed a function with these properties.
He picked the sigmoid and plugged it in and it works just fine.
It's been a good enough default that,
um, it's still- it's still widely used, right?
But, but, but, but, but I've used this um,
double-sided exponential or sometimes also called the Laplacian distribution.
This, this works fine as well as a choice of a P of s. Okay?
[NOISE]
So the final step.
Um, the density of s is equal to [NOISE].
Right? The product of the, uh, uh, um, um, let's see.
Ah, it's a product from i equals 1 through
your n sources of the probability of each of the speakers emitting that sound, right?
Ah, because the n speakers
[NOISE] are speaking independently, right? Yeah.
[inaudible].
Say, say that again?
[inaudible] .
Oh, yes. You're right. Sorry about that.
Yes, this should have been [NOISE].
Sorry, yes this should have been a P_s.
Yeah. Thank you. [NOISE] Right,
go from a CDF to PDF by taking derivatives.
All right. Cool. So, um, er,
S is the vector of all, you know,
two speakers' or all five speakers' voices at one moment in time.
So the density of S, right?
S as an RN is the product of the individual speakers' probabilities,
and, um, this is the key assumption of ICA that,
you know, your two speakers or your five speakers are having independent conversations,
and so at every moment in time,
they choose independently of each other what sound to emit.
All right.
Um, and so using the formulas we worked out just now.
The density of x is equal to, um, well, as we did,
the density of, uh, W_x times the determinant of W. [NOISE] Right?
Uh, so- and this is equal to [NOISE]
Okay. Um, and this notation,
uh, W_I transpose x, this is, um [NOISE] right?
Because W_I is the I th row of the matrix W and so,
um, you know, I guess S- S_j is equal to,
um, W_j transpose X, right?
So t- take the corresponding row and multiply it by x to get a corresponding source.
Actually, sorry. I think this right, yeah,
let me use j there to make this clearer.
[NOISE] Okay.
[NOISE] And so, um, this writes out- so this shows
what is the density of x, um, expressed as a function of,
um, P_s, which we've assumed- which effects as a CDF of the Sigmoid as
a, as the derivative of the Sigmoid and as a function of the parameter W. Right?
So this is a model that,
given a setting of the parameter W which is a square matrix,
um, allows us to write down what's the density of X.
[NOISE]
So the final step is,
um, we could use [NOISE] maximum likelihood estimation to estimate the parameters w.
Um, so the log-likelihood of W is equal to sum over
the training examples of log- of, you know, [NOISE]
times by W. Right.
And, um, you can use stochastic gradient ascent.
[NOISE] All right.
Take the derivative of w with respect to the log-likelihood.
Um, and it turns out- this is derived in the lecture notes.
I'll just write it out here. [NOISE]
Times x i. [NOISE]
I hope I got that right. Yeah. Okay. Right. [NOISE] Um, yeah.
And it turns out that, um,
if you use this formula don- don't worry about
the formula for the derivatives, there are full derivations given in the lecture notes.
But it turns out that, um,
if you use the derivative of the log-likelihood with respect to parameter
matrix W and use stochastic gradient ascent to maximize the log likelihood,
uh, run this for a while, then you can get, um,
ICA to find a pretty good matrix W,
um, for unmixing the sources, okay?
So just to recap the whole algorithm, right?
You would have a training set of X_1 [NOISE] up through X_m,
where each of your training examples is the, um,
er, microphone recordings at one moment in time,
[NOISE] and so the time goes from 1 through M.
What you do is initialize the matrix W, say,
randomly and use gradient ascent with
this formula for the derivative in order to maximize the log-likelihood of the data,
and after gradient ascent converges,
you then have a matrix W and you can then recover the sources as S equals W_x.
And then now, we have the sources, you can take, um, say,
S_1_1 through S_1_m and play that through your,
um, your laptop speaker in order to see what source one sounds like.
Right? And so that's how you would take, you know,
overlapping voices and [NOISE] try to unmix them.
Okay. Oh, yeah.
[inaudible]
Oh why is choices A point not a rotation matrix?
Uh, er, boy how to visualize that.
Try plotting it in, um, NumPy, matplotlib I guess.
If you plot the contours of the- so it turns out that if this is S_1 and S_2,
what you do not want is the den- density whose contours look like that.
Um, I haven't done this for a while.
I believe if you take this distribution,
the contours will look like that.
[NOISE] It's been a while since I looked at this,
but I think it'll look like that.
So this is not rotational symmetry.
You're on it. Well, it's Laplace.
Yeah. Okay. Yeah. Oh, yes. Laplace definitely looks like that.
I think Sigmoid looks a bit like that too. Yeah, little like that.
Plot it and see if I'm right, or post on Piazza, if one of you plots it.
So you can see it, I haven't done that for a long time.
Yeah, at the back.
[inaudible]
Oh, um, um, why don't you interact with the derivative of the log? The- th- actually,
yes, the log should be like this, I think. Yes.
[BACKGROUND]
Oh, sorry, uh, g is the sigmoid function.
Yes, so g of z. Yeah, thank you. Right, more questions?
[inaudible]
Sure. What's the, you know,
um, what's the closest non-linear extension of this?
Um, I don't- we don't a have a great answer to that right now frankly, um, uh,
so a bunch of people including,
you know, my former students and me,
have done research to try to extend this to
nonlinear versions and there's some stuff that kind of works,
but I don't think there's like, uh,
tried and true algorithm that I'm ready to say this is a right way to do it.
Um, uh, yeah,
actually maybe I should [NOISE] think I could
say a little bit more about that if you're interested.
Well, yeah, actually, uh, let me- let me try to- [NOISE]
All right. Let's see.
So, so for several-
several years ago and- and still kind of ongoing,
there's been research, um,
some done by my collaborators and me,
some done by others on trying to build nonlinear versions of ICA,
and so some of you might have seen this slightly infamous,
um, Google cat result, right?
Uh, so this one was in the Google Brain project, one of the first projects we did.
This is a few years ago now where, um,
we trained a neural network,
uh, uh, on, um, was it many,
many hours of YouTube videos, uh, and,
and eventually it learnt to
detect cats because apparently there are a lot of cats in YouTube videos.
Um, uh, and so it turns out that the algorithm we used was a,
um, was sparse coding which is actually very closely related to ICA.
Um, and so this rough algorithm was attempting to build a nonlinear version of ICA,
where you train one version one- train- train train one layer of sparse coding let's say,
to extract low level features and then recursively apply this on top,
to learn not just edge detectors,
but object part detectors,
and then eventually, you know,
the somewhat infamous, um,
uh, this somewhat infamous Google cat.
Um, but I think that this is actually still ongoing research.
Um, I think the most interesting research, uh,
some of the most interesting research has been on hierarchical versions of sparse coding,
sparse coding is a different algorithm that turns out to be very closely related to ICA,
and then you can show that they're optimizing very similar things.
So, so I say sparse coding is very similar to ICA,
uh, but they're hierarchical versions of this,
they tried to turn this as a multilayered neural network and it kinda worked,
wherever that shows it can learn interesting features.
But what happened was, uh,
supervised learning then really took off and the whole world shifted a lot of
this attention to supervised learning and
building deeper supervised learning neural networks.
And so, the hierarchical sparse coding running
ICA over and over to learn nonlinear versions.
There- there's very less, uh,
attention from research on the- on that topic than it- than it really deserves.
So may- maybe you or someone in a class could go back and do more research on that.
I, I still think is a promising area. All right.
Um, so let me wrap up with, uh, some ICA examples, um,
so this is actually a former TA from the class, um, Catie Chang.
Um, and so it turns out that, uh,
ICAs are routinely used to clean up EEG data today,
so what's an EEG, right?
Um, place many electrodes on your scalp, uh,
to measure low electrical recordings,
uh, on the surface of your scalp.
So, you know, wha- what does the human brain do, right?
Human brain, your neurons in your brain right now,
uh, fire, generate little pulses of electricity,
and if you put- place electrodes on your scalp,
you can get very weak measurements of the,
um, of the voltage of the electrical activity,
in a, you know, at a certain point in your scalp.
So the analogy to- um, oh, excuse me.
Uh, oh, what's wrong. All right.
So the analogy to the cocktail party problem, the, um,
overlapping speakers' voices is that, you know,
your- your brain [NOISE] does a lot of things at the same time, right?
Your brain helps regulate your heartbeat, um,
part of your brain does that,
another part of your brain, you know,
makes your eyes blink every now and then,
another part of your brain- part of your brain is also
responsible for making sure that you breathe,
another part of your brain is responsible to
thinking about machine learning and stuff like that, right?
[LAUGHTER] So, so your brain actually handles many,
many tasks at the same time.
And as your brain, um,
sorry, not sure what's wrong with this.
Okay. And as your brain, um, uh,
carries out these different tasks in parallel,
uh, different parts of your brain generate different electrical impulses.
So think of there as, um,
imagine that you have a, you know,
cocktail party in your head, right?
So many overlapping voices,
so this is now voices in your head, uh, just going back,
but one- one- one part of your brain is saying,
all right heart, go and beat, heart go and beat,
heart go and beat, and another part of the brain is saying, hey,
breathe in and breathe out, breathe in and breathe out,
another part of the brain is ooh, you know.
What's wrong with this PowerPoint?
[LAUGHTER] That's what my brain is saying, right?
Um, and uh, what each electrode on the surface of your scalp does is it
measures an overlapping combination of all of these voices
because different parts of your brain are sending these electrical impulses,
they add up and so any one point on the surface of your brain,
reflects a sum or a mixture,
re- really a sum of these different voices,
of these different things your brain is doing.
Um, and so, uh, if you- just- just zooming in to the EEG plot, um,
each line is a voltage measured at a single electrode, right?
On say your scalp and, um, these, uh, signals are quite correlated,
you see that when there's a massive voice in your brain shouting,
you know, like, uh, uh, uh,
uh, right, beat your heart or blink your eyes,
that signal can go through all of the different electrodes,
which is why you can see these artifacts reflected in all of
these electrodes, um, uh, sorry.
All right. Turns out a pretty good way to clean up
this data is to take all of these time series
pre- pretty much exactly as we learned about it with
the ICA algorithm [NOISE] and separate it out into the independent components,
and so, um, it turns out in this example,
there are two components corresponding to driving the heartbeat,
um, that's actually the eye blink component,
and so one way to clean up this data- sorry,
I should really wonder what's wrong with this.
All right. Let me try something, [NOISE] um, maybe if I,
[NOISE] uh, oh, that's interesting. All right.
Okay, well, all right.
Um, if you, uh, uh, right,
it says heartbeat, there's eye blink,
and, uh, you don't get, all right.
And, um, if you run the ICA and then remove outs,
I have a person say,
"Oh this heartbeat, this eye blink, can remove,
subtract all those components,
then you can end up with a, um,
much more cleaned up EEG signal,
which you can then use for downstream processing.
So actually we possibly- is, there's been a lot of research on.
You've taken an EEG reading to try to guess at a high-level what you're thinking, right?
It turns out that, uh, uh, if your train a, train a, train a,
you know, supervised learning algorithm, uh,
to try to decide, are you thinking of a noun or a verb, are you thinking of,
uh, something edible, or are you thinking of,
uh, uh, something inedible.
There's been very interesting research, uh,
trying to use an EEG to figure out just at a very coarse level, um,
not- not- not- not quite mindreading every thought you're thinking, but,
but, uh, uh, uh, but, uh,
can we categorize very coarse level thoughts?
Like, are you thinking of a person,
are you thinking of an object?
And you can actually do that to some extent using EEG readings.
But cleaning up the data to get rid of the eye blink, and
the heartbeat artifacts is a very useful, um,
pre-processing step to get cleaner data,
to feed into the learning algorithm,
to try to figure out, try to categorize,
you know, some coarse category of what you're thinking.
Okay. Um, and then more research here,
it turns out that- uh, we're kind of- I,
I mentioned the Google cat thing just now.
It turns out that, um, if you, um, uh, train ICA, uh,
oh, the font is messed up.
Um, if you train ICA on, uh, natural images, um,
ICA will say that the natu- the independent components of natural images are these edges.
Uh, and as in that, you know,
when you see a little image patch in the world,
when you've seen, you know, look, look,
look somewhere in the world, look at just a tiny little piece of the image, right?
Like 10 pixels by 10 pixels.
Um, and if you take that data and model in this ICA,
ICA will say that, uh,
the world is made up of edges or made up of patches like these and that, uh,
the way you end up with images in the world is by each of these patches,
you know, independently saying is there a vertical edge,
is there a horizontal edge, was there,
is there this type of, um,
uh, light on the left, dark on the right?
Is there this type of, uh, lighter on top,
darker on the bottom and so on.
And just by adding all of these voices that you get a typical image fashion of the world.
So they're, they're interesting theories in neuroscience about whether this is how,
you know, the human brain learns to see as well.
So, so very, very same work on, um,
ICA and sparse coding to try to use these mechanisms to explain how, you know,
the human brain tries to explain,
um, uh, tries, tries to learn to perceive images, for example.
Okay? Um, so all right.
So [NOISE] that's it for, um,
uh, the algorithms of ICA,
um, just the final comments.
Um, I think on Monday someone asked,
"Do the number of speakers and number of microphones need to be equal?"
So it turns out that, um, if the number of, uh, um,
microphones is larger than the number of speakers,
that's actually fine, right?
If you- if the number of microphones is larger than the number of speakers,
then if you run ICA or, or a slightly modified version of it,
you'll find that some of the speakers are just silent speakers.
Um, uh, and so, you know,
if you have, uh, 10 microphones and five speakers,
if you run this algorithm on 10 microphones, you can find that, well,
maybe five of the sources are just silent or there are
ways to just not model those five sources as well, right?
If, if you think that, uh, they're just some sources of silence.
So, so, this, so,
so a slightly modified version of this works quite well if,
um, uh, the number of speakers is larger than the number of microphones.
Um, if the- excuse me,
if the number of microphones is larger than the number of speakers,
this, this, this works quite well.
If the number of microphones is smaller than the number of speakers,
then that's still, um, uh,
very much a cutting edge research problem.
Uh, so, so for example, uh,
if you have two speakers and one microphone, um, uh,
it turns out that if you have one male and one female speaker,
so one relatively higher pitch and one much lower pitch,
then you can sometimes have some algorithms
that separate out two voices with one microphone.
Um, but it doesn't work that reliably,
it's a little bit finicky but there have been
research papers published showing that, you know,
you could make a reasonable attempt at separating out, um,
two voices with mi- one microphone if
the pitches are quite different such as this one male one female voice.
Um, uh, uh, but separating out two male voices or two female voices is still very hard,
um, uh, and, and then there's ongoing research in, in those settings.
Right? So that's ICA,
um, and I guess you get to play more of it in your,
um, homework problem as well.
Okay? Any last questions about ICA?
[inaudible]
Oh sorry, say it again?
[inaudible] [NOISE]
Wait, sorry, was because a-
So I'm just wondering why is it that hard [inaudible]
Oh, [NOISE] yeah so, um,
uh, I think- actually if you go through a lot of the math it,
it, it, it just breaks down, I think.
Um, because there- you can have two independent sources
but W is now no longer a square matrix, right?
Of your, what is it?
Um, uh, uh, so- uh, uh, right.
Is that x is equal to AS, right?
And so if, um,
x is a real number and S was two-dimensional,
so I guess this would be, um,
uh, uh, A would be 2 by 1,
S would be- uh, S- uh,
A would be 2 by 1, S would be 2- excuse me,
A would be 1 by 2 and S would be a 2 by 1,
and this is 1 by 1, then,
you know, A inverse kind of doesn't exist, right?
So you need to come up with a way to form the maximum likelihood model.
And, and when you have one microphone,
it's just how do you separate out two overlapping voices,right?
Does that make sense?
So it takes much higher level knowledge,
um, uh, yeah, to separate out two voices.
Does this make sense? Um, so go ahead.
[inaudible]
Oh I see, right, uh, let's see.
So right, so if you don't know how many speakers there are,
you have all these microphones where you have
all- the number of electrodes you have is fixed,
so that's just your data set.
And it turns out that, uh, um,
if you run ICA with a large number of speakers,
you find there are many speakers are silent.
There are also some versions of ICA that you- so if you think that there are,
um, uh, let's see, boy- those transfer some of this.
But it turns out that, um,
if you think that there is a relatively small number of speakers,
then you don't need to explicitly model all the speakers.
Instead, what you would model wou- so again,
um, uh, suppose it's a maximum likelihood estimation problem.
Um, let's say that, uh, x is an R10, right?
So you have 10 recordings.
But you suspect that you only have five speakers.
Then in this case,
I guess the ma- matrix A would be um, what is it?
Uh, was it?
It would be 10 by 5, is it?
Right? To mix the 5 sources into 10 speakers.
And you could, um,
form the maximum likelihood estimation problem assuming the existence of
only five speakers without modeling
a lot of speakers and then finding later that they're all silent.
Does that make sense? So if you form the- so if- if you
parameterize the model like this using A instead of W, um, uh,
then you could form the maximum likelihood estimation problem
where you just assume that there are
five speakers and S is generated by
five speakers mixing through a linear thing plus noise.
But I just think that if you don't know how many speakers you
have or even what you are- what speakers you are working on,
how would you know if you probably had enough microphones?
Oh I see, sure, right.
How do you know if you have- how do you know how many speakers you have?
So I, I think it's one of those things that's a little bit like k-means,
I guess, where you try it and see what works.
And if you find that, uh,
the first few, you know,
speakers will capture most of the variance,
you find that digital speakers are quite silent and they're quite small,
you could just cut it off at that time.
I don't wanna go too much into the different numbers of speakers and,
and, and, uh, microphones, ICA algorithms.
Uh, uh, but let me just take a couple of last questions
and move on. You have a question? Yeah.
Do you ever see a problem with W?
Say it again?
Do you ever see a problem with W?
Oh, do you ever see a problem with W?
Um, I'm sure you can.
It's not usually done in this version of the algorithm,
but I would not be surprised if there are some other versions where you do.
I've, I've not seen that a lot myself actually.
All right, cool.
All right, cool.
Um, let's see.
All right, good, we're far enough along.
Okay, good. Um, so-
[NOISE] Circumstantial- All right.
All right, yeah, let's do these- [NOISE]
All right. Um- [NOISE]
All right. So that wraps up,
um, our chapter on unsupervised learning.
So, um, you learned about I,
guess, k-means clustering, um,
the EM algorithm for mixture of Gaussians, uh,
or really mixture of Gaussians model, um,
factor analysis model, and also PCA.
And then, you know,
today the ICA or independent components analysis algorithm.
And all of these are algorithms that could take as input an unlabeled training set,
just the xi's and no labels.
And we'll find various interesting structures in the data such as
clusters or subspaces or in the case of ICA,
the voices of the independent speakers.
And, and you implement ICA and play with it yourself in the homework problem,
where you get to separate out many five overlapping, um, voices.
The last of the four major topics, I want to cover in this class.
We've talked about supervised learning,
kind of device machine learning, unsupervised learning,
and the fourth and the final major topic we'll cover in this course will be
on reinforcement learning [NOISE].
Okay. So, um, so to motivate reinforcement learning.
Um, let's say you want to have a computer,
uh, learn to fly a helicopter, right?
I think I showed you some of the videos that are in the first lecture,
and so I just skipped that here.
But it turns out that, um,
if you are at every point in time given the position of a helicopter,
called the state of a helicopter,
and you're asked to take an action on how to move the control sticks,
you know, to make the helicopter fly in a certain trajectory.
It turns out that it's very difficult to know what's
the one right answer for how to move the control sticks of a helicopter.
Right. So if you don't have a mapping from X to Y because
you can't quite specify the one true way to fly a helicopter,
um, it's hard to use supervised learning for that, right.
And what reinforcement learning does is, is, is an,
an algorithm that doesn't ask you to tell it the right answer at every step,
it doesn't ask you to tell it exactly what's
the one true way to move the controls of a helicopter at any moment in time.
Instead, your responsibility as a designer
or machine learning engineer or AI engineer is to
specify a reward function that just tells
the helicopter when it's flying well and when it's flying poorly.
So your job as a designer is to write the cost function or
a reward function that gives a helicopter a high reward whenever it's doing well.
Flying accurately, flying the trajectory you want it to,
and it gives the helicopter a larger negative reward,
um, whenever it crashes or does something bad, right?
And I think I, I, I remember, I think, you know, think of it as like training a dog, right?
When do you say good dog, when do you say bad dog?
And the dog figures out when to do more of the good dog things.
And your job is not to tell the dog,
when you can't actually talk to the dog,
and tell it what to do. I guess that doesn't work.
But you can tell it good dog and bad dog,
and hopefully it learns from those positive and negative rewards
how to do more of the good things.
Okay. Um, another example.
Um, let's say you want to write a program to play chess or I guess most, you know,
somewhat famously and, uh, uh,
arguably somewhat slightly overhyped Go, AlphaGo, right.
Um, so it's very difficult to know in
given a certain chess board position or checkers or Go board position,
what is the one true move,
what's the one best move.
So it's very difficult to formulate, um, you know,
playing chess, uh, uh,
as a supervised learning problem.
And instead, um, the mechanisms used to play
chess are much more like reinforcement learning,
where you can, um,
let your program play chess or Go or whatever.
And whenever it wins you go, "Oh good computer."
And when it loses you go, "Oh bad computer."
So that's a reward function.
And the learning algorithm's job is to figure out by
itself how to get more of the positive rewards, right?
And actually common rewards for, uh, learning to play,
uh, chess or checkers or Othello or Go is, uh,
plus a reward of plus 1 for a win,
minus 1 for a lose, and a 0 for a tie, right?
So as you write your chess-playing programs, there has to be a common choice for a reward.
Um, where R is the reward function and S is the state.
Okay. And I will go into the notation, um, in a little bit.
And so as you can imagine, um,
given only this type of information so say a chess-playing program,
it places much more burden on the program to figure out what to do.
Right. In fact, one of the challenges of reinforcement learning is,
uh- so this is called a reward,
and that's called the state.
And the state means, um,
the status of the chessboard.
Where are the P's in the chessboard?
Or the status of the helicopter.
Where exactly is the helicopter?
And you're either right-side up or you're upside down,
and where are you, right?
Um, and it turns out one of the challenges,
one of the things that makes, um,
reinforcement learning hard is,
uh, the credit assignment problem.
And that means that if, uh,
your program is playing a game of chess,
and let's say it loses on move 50.
You know, so it plays a game,
and then on move 50, right, is checkmated and loses to its opponent.
So it gets a reward of negative 1.
But how can the program actually figure out
what it did well and what it did poorly, right?
If you lose a game on move 50,
it might be that the program made a really bad move,
made a blunder at move 20.
And then, you know,
but it just took another 30 moves before its fate was sealed, right.
So in a game of chess, you made a bad mistake early on,
you can still take many, many games- many,
many moves in the game of chess before,
before the final outcome of,
of losing or winning or losing is reached.
Or, um, in a, uh, initiate another- it turns out that, uh,
if you are trying to build a self-driving car,
um, if ever car crashes, right,
chances are the thing the car was doing right before it crashes was brake,
but it's not braking that caused the crash.
It's probably something else that caused it many,
many seconds ago that led to the bad outcome.
So there's a bad outcome.
How does the algorithm know of all the things that it did before,
how does it know what it did well?
What it should do more of and what they should- did poorly,
what it should do less of.
And, and conversely, if there's a good outcome,
you know, like it wins a game of chess.
Well, how do you know what you did well, right?
So that's called the credit assignment problem,
which is when your algorithm gets some reward,
how, how do you actually figure out what you did well and what you did poorly?
So you know what to do you more of and what to do less of, right?
So, um, as we develop reinforcing learning algorithms,
we'll see that the algorithms we use have to at least indirectly,
um, try to solve the credit assignment problem.
Okay. So, um,
reinforcement learning problems like playing chess or flying helicopters or, um, uh,
you know, building these various robots is modeled using the,
um, MDP or the Markov decision process formalism. [NOISE]
Um, and this is a way-
this is a notation and the formalism for modeling how the world works,
and then reinforcement learning algorithms will solve problems using this formalism.
So what's an MDP?
So an MDP is a five tuple.
And let me explain what each of these are.
Um, so S is the set of states.
So for example, uh,
in chess this would be the set of all possible chess positions or in,
uh, flying a helicopter.
This would be the set of all the possible positions,
and orientations, and velocities of your helicopter.
A is the set of actions, um, where, uh,
in the helicopter this would be all the positions you could move
your controls sticks or in chess this would be all the moves you can make, you know, in a,
in a game of chess. [NOISE].
Uh, P subscript sa is a-a state transition probabilities and so, um,
we'll see later these-these state transition probabilities tell you,
if you take a certain, uh, action a and a certain state s,
what is the chance of you ending up at a particular different state s prime?
Great.
Um, gamma is
a discount factor, that's a number between 0 and 1.
Uh, don't worry about this for now,
we'll come back to this in a minute,
and R is that all important reward function.
Okay, so, um,
in order to develop a reinforcement learning algorithm, um,
I'm going to use, as a running example,
a simplified MDP that we can draw on the whiteboard.
Right, so helicopters and chess and go and so on are really complicated MDPs.
So just to illustrate the algorithms,
I want to use a simpler MDP, uh, and this is, um,
an example we've drawn from the textbook Russell and Norvig.
Um, I'm going to use
simplified MDP in which you have a robot navigating this simple maze,
ah, and there's an obstacle.
So this is a grid work, right.
So a robot, you know- well the R2D2 like robot.
Yes, right, um, and it's navigating this very simple maze,
uh, and this is a pillar or this is a wall,
so you can't walk into that wall,
[NOISE] and let me just use,
um, indexing on the states as follows.
Um, so this MDP- let's- let's go through the five top points and talk about what,
uh, the- the- each of the five things are.
So this MDP has 11 states
corresponding to the 11 possible positions that the robot could be in,
right, each of these bank squares.
So there are 11 possible states,
and the actions, um,
are North, South, East and West, right?
You can command your robot to move in any of these directions.
Um, and I don't know if- if you worked with robots before, you know that, um,
when you command a robot, uh, you know,
to head straight, um,
it doesn't always go exactly straight.
Sometimes the wheel slips and veers off at a slight angle,
and so just simplifying the example,
we're going to model it as that, um,
if you command the robot to go North from a certain state,
that there is a 0.8% chance of successfully go the way you told it
to and a 0.1 chance that it will
accidentally veer off to the left or accidentally veer off to the right, okay?
Um, if you're working on real robots,
right, What's a real robot?
Uh, it is actually important to model the noisy dynamics of a robot wheel slipping slightly.
Or the orientation being slightly off.
Now, um, in a real robot,
you have a much bigger state space than the 11 states,
right, so- so this is simplified.
So this is not a realistic model for how
robots actually slip but because of using such a small state space,
I think just for illustration purposes,
we'll- we'll- we'll use this.
Um, and so for example,
the state transition probability would specify these.
You'd say that if you're in the state 3, 1.
So this state 3, 1,
and you command it to go North,
that the chance of getting to the state 3, 2 is,
uh, 0.8, and the chance
of getting to the state 4, 1 is 0.1,
chance again to 2, 1 is 0.1,
um, and the chance of getting to other states is like 3, 3 and other states is equal to 0, okay?
So the state transition probabilities will capture that,
if you're here and decide to go North,
there is a 0.8 chance you are going here,
0.1 chance you are going here,
0.1 chance you are going here, and you know,
you've got 0.0 chance of, right, hopping two steps.
Okay. Um, and- and again just simplifying the MDP example.
We'll just assume that the-the robot, you know,
hits a wall, it just bounces off the wall and stays where it is.
So if you told it to go East,
it slips off and just bounced off the wall and stays exactly where it is.
Now, let's specify the reward function,
uh, we'll come back to discount factor later.
But let's say you want the robot to navigate
to this cell in the upper right-hand corner, um,
and so to incentivize the reward- incentivize the robot to get to this square,
you know, that's the prize or that's the goal anyways,
let's put a plus 1 reward there and, um,
let's say you really don't want the robot to go to this cell,
you could put a negative 1 reward there.
Alright. So, um, the way you specify
the tasks for a robot to do is in designing the reward function.
So in our example,
um, well let me just copy that again, plus 1, minus 1.
Um, we have that the reward at the cell 4, 3 is plus 1,
and the reward at the cell 4, 2 is negative 1.
Um, and then, you know,
if you want the robot to get to the +1 reward cell as quickly as possible,
then, um, again there- there are many ways of designing reward functions.
Well, one common choice would be to,
um, put the negative penalty,
a very small negative penalty,
right, such as a set the reward to negative 0.02 for all other states.
And the effect of a small negative reward like this is to charge it,
right, every- every step it's just loitering around.
So charge it a little bit for using up electricity and wandering around, uh,
because this incentivizes the robot to hurry up and get to the plus 1 reward.
So you give a small penalty, you know,
for- for loitering and wasting electricity.
So this is how an MDP works.
Um, your robot wakes up at some state as 0,
um, at time 0, you know,
as you turn on the robot and the robot says,
"Oh, I'm at that state."
And based on what state it is in, um,
it will get to choose some action,
a0, so decide to only go North,
South, East or West and choose some action.
Based on the action,
the consequence of the choice is it will get to some state S1.
Uh, the state at the next time step,
which is distributed according to
the state transition probabilities governed by
the previous state and the action it chose, right.
So depending on what action it chooses,
there is different chances of moving North, South, East or West.
Now, that there's an S1,
it then has to choose a new action a1,
and as a consequence of the action a1,
it will get to some new state S2,
which is governed by,
um, the state transition probabilities, you know, s1,
a1, and so on, okay?
And- and the robot just keeps on running.
And so the robots will go through a sequence of states S_0,
S_1, S_2 and so on,
depending on the choices it receives,
depending on actions it chooses.
And the total payoff
is written as follows with one more detail,
is that term Gamma.
So think of Gamma as a number like 0.99.
So Gamma is usually chosen to be just slightly less than one,
and what the- so the total payoff
is the sum of rewards or more technically is a sum of discounted rewards,
and what this does is it adds up all the rewards that the robot receives over time,
but the further reward is into the future, um, you know,
the- the- the smaller the Gamma is the power of time that that reward is multiplied by.
Okay. So any reward you get at time 1,
you get all of that.
Every reward you get at time 2 is multiplied by 0.99.
And the reward you get at the next step is multiplied by 0.99 squared, 0.99 cubed and so on.
And so what the, um, discount factor, ah, does,
is it has the effect of giving a smaller weight to rewards in the distant future, um,
and this means that this encourages the robot to also get deposited rewards faster,
um, or postpone the negative rewards, right?
And so in, uh, financial applications, um,
the discount factor has a natural interpretation,
as the time value of money,
because if you have a dollar today,
you know, you're better off having
a dollar today that have being a year- dollar a year from now.
Right? Because when you put the dollar in the bank and earn interests, uh, uh,
for a year on your dollar and so a dollar
today is strictly better than the dollar the future.
Um, and conversely, having to pay $100,
or having to pay one dollar a year from now is also
better than having to pay a dollar today, right?
Because if you could, you know,
save your money and earn interest and then issue
a payment to someone else a year from now rather than now,
then you're actually slightly wealthier, um, and so, uh, uh,
and so Gamma in financial applications has the interpretation as the time value of money,
um, uh, or as the interest rate, I guess.
Um, uh, and but, but, but,
more generally even for non-financial applications, ah, mostly wrote,
most- most the- there are some financial application reinforcement learning programs,
there are lots of non-financial applications as well.
Um, this mechanism of using a discount factor has the effect of
encouraging the system to get to the positive rewards as quickly as possible,
uh, but then also conversely to try to push
the negative rewards as far into the future is possible, right?
And I think, uh, to be pragmatic,
there are two reasons why people use Gamma.
The story I just told, time value of money,
your frontal deposit of rewards, postponed rewards.
That's, uh, that's the story you tend to people- you tend to,
uh, uh, hear people say in terms of why we have a discount factor.
Uh, the other reason we have the discount factor is
actually  much more pragmatic one which is that
all the reinforcement learning algorithms you see,
they converge much faster or they weren't
much better if you're willing to have a discount factor.
Whereas it turns out that if Gamma is,
is, equal to 1, if,
if Gamma is not strictly less than 1,
um, uh, it's much harder or,
or there, there are many reinforcement learning algorithms that, uh, may not converge.
It's much harder to prove convergence or they may not converge.
So just as a pragmatic thing.
Um, this makes the job much easier for your algebra.
So I see some of you shaking your heads in dis- in disapproval. [LAUGHTER].
[inaudible].
Yeah.
Yeah.
Yes yeah that's a good point.
Yes. So one of the things if there's no Gamma is that, uh, the rewards,
some of the rewards, you know,
could be- can increase or decrease without bounds.
So by having Gamma it does guarantees that
the total payoff is a finite value or is a bounded value.
So that, that's, that's one of the parts that go into some of
the proofs or some of the reasoning behind
why reinforcement learning algorithms converge.
So, cool, that's good insight.
Um, okay. So the goal of reinforcement learning is to
choose actions over time,
to maximize the expected total payoff.
Okay. And in particular, um,
what most reinforcement learning algorithms will come up with,
is a policy, um, that maps from states to actions.
Right? So the output of most reinforcement learning algorithms will be a policy, um, or,
or controller, in the R world we
tend to use the term policy but policy just means controller,
the maps of states to actions.
So it turns out that, um,
for the MDP that we have, um, right?
It turns out that this is the optimal policy.
So for example, ah,
if you take this example,
this, this cell here,
this cell over here,
this policy is saying pi applied to the state 3, 1,
is equal to West.
Hey, and that- so [NOISE] excuse me.
So it separately worked out,
what is the optimal policy,
and this turns out to be optimal policy in the sense that,
if you, um, we say execute this policy,
so to execute the policy means that whenever you're in the state S,
take the action given by Pi of S,
so that's what it means to execute a certain policy.
And it turns out that, um,
this policy will- I, I,
I worked all separately, right, offline.
Yeah. And, in, um,
uh, on my laptop, uh, uh,
that this is the optimal policy for this MDP,
and it turns out that if you execute this policy,
meaning whenever you're in a certain state, you know,
you'll take the action indicated by the arrow,
that this is the policy that will maximize the expected total payoff.
Okay. Um, and the problem in reinforcement learning is,
given a definition for an MDP,
or given a problem to pose,
the problem as an MDP,
figure out what's the set of states,
what's the set of actions,
um, what are the state transition probabilities,
specified discount factor and specified reward function.
And then to have a reinforcement learning algorithm,
find the policy Pi that maximizes the expected payoff.
And then when you want your robot to act or when you want your chess playing program to act,
um, whenever you're in some state S,
take the action given by Pi of S,
and hopefully this will result in a robot that,
you know, efficiently navigates to the plus 1 state.
Okay? So it turns out that MDPs are quite good at making fine distinctions.
So one example, um,
it's actually not totally obvious whether here
you're better off going North or going West.
Right? And it turns out that there is a trade off.
If you go west here, then, you know,
you're gonna take a longer route to get to the plus one.
So you take longer,
uh, the plus minus is discounted more heavily,
you're taking these penalties along the way, [NOISE] excuse me.
But on the flip side,
if you were to try to go North,
you could try to get there faster.
But on this step,
there's a 0.1% chance that you accidentally slip off to the minus 1 state.
So, so what is the optimal action?
Right? It's actually quite hard to just look at it with your eyes and make a decision.
But it turns out that if you solve for the optimal set of actions in
this MDP in this example is I just take a longer and safer route. Question?
[inaudible]
Does this advice take us in policies?
So, uh, uh, if the optimal set of actions is to cycle around,
then it should find out, uh, uh,
I mean for example, if there are only penalties
everywhere and you just go and run in a circle,
you know, then, then, the- the algorithm will actually chose to do that.
Uh, but in this case,
you want to get to the plus one as quickly as possible.
Right? And so what we'll see, um,
there's one more question. Go ahead.
[inaudible]
So, so, so chess and checkers and go and so on- they're, one,
one complication that is, you take a move.
So actually, all right.
To refine the description to chess.
Um, what happens in playing chess is a state- status,
um, your board, right? So there's your move.
So you see a board that's a state,
and so you make a move and then the opponent makes a move and then that's the new state.
So the state is when you and your opponent
both make- take turns then this comes back to you.
Right? Um, and because you don't know exactly what your opponent will do,
there is a probability distribution over if
I make a move or what's the other person gonna do?
Uh, I guess one last question.
Yeah. Go ahead.
[inaudible] [NOISE]
Yeah. Right. The probabilities are assigned per node,
the 0.8, 0.1, 0.1 where does that come from?
Um, so we'll talk about that later,
uh, in some applications does this learn?
So if you build a robot,
you might not know is it 0.8, 0.1, 0.1 or, you know, 0.7, 0.115, 0.115.
So it's quite common to use data to learn those state transition probabilities as well.
We'll we'll, we'll, see a specific example of that nature.
Okay. So all right.
So where we are just to summarize,
this is how you formulate the problem as an MDP,
um, and then the, the, the,
the job reinforcing learning algorithm is ready to go from the MDP.
to telling you what is a good policy.
Okay. So let's break,
um, and have a good Thanksgiving everyone,
[NOISE] I won't see you for like a week and a half, uh, uh,
enjoy yourselves and we'll, we'll reconvene after Thanksgiving with, uh, with this.
