The following content is
provided under a Creative
Commons license.
Your support will help MIT
OpenCourseWare continue to offer
high quality educational
resources for free.
To make a donation or to view
additional materials from
hundreds of MIT courses,
visit MIT OpenCourseWare at
ocw.mit.edu.
Today we are going to see how
to use what we saw last time
about partial derivatives to
handle minimization or
maximization problems involving
functions of several variables.
Remember last time we said that
when we have a function,
say, of two variables, x and y,
then we have actually two
different derivatives,
partial f, partial x,
also called f sub x,
the derivative with respect to
x keeping y constant.
And we have partial f,
partial y, also called f sub y,
where we vary y and we keep x
as a constant.
And now, one thing I didn't
have time to tell you about but
hopefully you thought about in
recitation yesterday,
is the approximation formula
that tells you what happens if
you vary both x and y.
f sub x tells us what happens
if we change x a little bit,
by some small amount delta x.
f sub y tells us how f changes,
if you change y by a small
amount delta y.
If we do both at the same time
then the two effects will add up
with each other,
because you can imagine that
first you will change x and then
you will change y.
Or the other way around.
It doesn't really matter.
If we change x by a certain
amount delta x,
and if we change y by the
amount delta y,
and let's say that we have z=
f(x, y) then that changes by an
amount which is approximately f
sub x times delta x plus f sub y
times delta y.
And that is one of the most
important formulas about partial
derivatives.
The intuition for this,
again, is just the two effects
of if I change x by a small
amount and then I change y.
Well, first changing x will
modify f, how much does it
modify f?
The answer is the rate change
is f sub x.
And if I change y then the rate
of change of f when I change y
is f sub y.
So all together I get this
change as a value of f.
And, of course,
that is only an approximation
formula.
Actually, there would be higher
order terms involving second and
third derivatives and so on.
One way to justify this --
Sorry.
I was distracted by the
microphone.
OK.
How do we justify this formula?
Well, one way to think about it
is in terms of tangent plane
approximation.
Let's think about the tangent
plane with regard to a function
f.
We have some pictures to show
you.
It will be easier if I show you
pictures.
Remember, partial f,
partial x was obtained by
looking at the situation where y
is held constant.
That means I am slicing the
graph of f by a plane that is
parallel to the x,
z plane.
And when I change x,
z changes, and the slope of
that is going to be the
derivative with respect to x.
Now, if I do the same in the
other direction then I will have
similarly the slope in a slice
now parallel to the y,
z plane that will be partial f,
partial y.
In fact, in each case,
I have a line.
And that line is tangent to the
surface.
Now, if I have two lines
tangent to the surface,
well, then together they
determine for me the tangent
plane to the surface.
Let's try to see how that works.
We know that f sub x and f sub
y are the slopes of two tangent
lines to this plane,
two tangent lines to the graph.
And let's write down the
equations of these lines.
I am not going to write
parametric equations.
I am going to write them in
terms of x, y,
z coordinates.
Let's say that partial f of a
partial x at the given point is
equal to a.
That means that we have a line
given by the following
conditions.
I am going to keep y constant
equal to y0.
And I am going to change x.
And, as I change x,
z will change at the rate that
is equal to a.
That would be z = 0 a(x - x0).
That is how you would describe
a line that, I guess,
the one that is plotted in
green here, been dissected with
the slice parallel to the x,
z plane.
I hold y constant equal to y0.
And z is a function of x that
varies with a rate of a.
And now if I look similarly at
the other slice,
let's say that the partial with
respect to y is equal to b,
then I get another line which
is obtained by the fact that z
now will depend on y.
And the rate of change with
respect to y will be b.
While x is held constant equal
to x0.
These two lines are both going
to be in the tangent plane to
the surface.
They are both tangent to the
graph of f and together they
determine the plane.
And that plane is just given by
the formula z = z0 a( x - x0) b
( y - y0).
If you look at what happens --
This is the equation of a plane.
z equals constant times x plus
constant times y plus constant.
And if you look at what happens
if I hold y constant and vary x,
I will get the first line.
If I hold x constant and vary
y, I get the second line.
Another way to do it,
of course,
would provide actually
parametric equations of these
lines,
get vectors along them and then
take the cross-product to get
the normal vector to the plane.
And then get this equation for
the plane using the normal
vector.
That also works and it gives
you the same formula.
If you are curious of the
exercise, do it again using
parametrics and using
cross-product to get the plane
equation.
That is how we get the tangent
plane.
And now what this approximation
formula here says is that,
in fact, the graph of a
function is close to the tangent
plane.
If we were moving on the
tangent plane,
this would be an actual
equality.
Delta z would be a linear
function of delta x and delta y.
And the graph of a function is
near the tangent plane,
but is not quite the same,
so it is only an approximation
for small delta x and small
delta y.
The approximation formula says
the graph of f is close to its
tangent plane.
And we can use that formula
over here now to estimate how
the value of f changes if I
change x and y at the same time.
Questions about that?
Now that we have caught up with
what we were supposed to see on
Tuesday, I can tell you now
about max and min problems.
That is going to be an
application of partial
derivatives to look at
optimization problems.
Maybe ten years from now,
when you have a real job,
your job might be to actually
minimize the cost of something
or maximize the profit of
something or whatever.
But typically the function that
you will have to strive to
minimize or maximize will depend
on several variables.
If you have a function of one
variable, you know that to find
its minimum or its maximum you
look at the derivative and set
that equal to zero.
And you try to then look at
what happens to the function.
Here it is going to be kind of
similar, except,
of course, we have several
derivatives.
For today we will think about a
function of two variables,
but it works exactly the same
if you have three variables,
ten variables,
a million variables.
The first observation is that
if we have a local minimum or a
local maximum then both partial
derivatives,
so partial f partial x and
partial f partial y,
are both zero at the same time.
Why is that?
Well, let's say that f of x is
zero.
That means when I vary x to
first order the function doesn't
change.
Maybe that is because it is
going through...
If I look only at the slice
parallel to the x-axis then
maybe I am going through the
minimum.
But if partial f,
partial y is not 0 then
actually, by changing y,
I could still make a value
larger or smaller.
That wouldn't be an actual
maximum or minimum.
It would only be a maximum or
minimum if I stay in the slice.
But if I allow myself to change
y that doesn't work.
I need actually to know that if
I change y the value will not
change either to first order.
That is why you also need
partial f, partial y to be zero.
Now, let's say that they are
both zero.
Well, why is that enough?
It is essentially enough
because of this formula telling
me that if both of these guys
are zero then to first order the
function doesn't change.
Then, of course,
there will be maybe quadratic
terms that will actually turn
that, you know,
this won't really say that your
function is actually constant.
It will just tell you that
maybe it will actually be
quadratic or higher order in
delta x and delta y.
That is what you expect to have
at a maximum or a minimum.
The condition is the same thing
as saying that the tangent plane
to the graph is actually going
to be horizontal.
And that is what you want to
have.
Say you have a minimum,
well, the tangent plane at this
point, at the bottom of the
graph is going to be horizontal.
And you can see that on this
equation of a tangent plane,
when both these coefficients
are 0 that is when the equation
becomes z equals constant:
the horizontal plane.
Does that make sense?
We will have a name for this
kind of point because,
actually,
what we will see very soon is
that these conditions are
necessary but are not
sufficient.
There are actually other kinds
of points where the partial
derivatives are zero.
Let's give a name to this.
We say the definition is (x0,
y0) is a critical point of f --
-- if the partial derivative,
with respect to x,
and partial derivative with
respect to y are both zero.
Generally, you would want all
the partial derivatives,
no matter how many variables
you have, to be zero at the same
time.
Let's see an example.
Let's say I give you the
function f(x;y)= x^2 - 2xy 3y^2
2x - 2y.
And let's try to figure out
whether we can minimize or
maximize this.
What we would start doing
immediately is taking the
partial derivatives.
What is f sub x?
It starts with 2x - 2y 0 2.
Remember that y is a constant
so this differentiates to zero.
Now, if we do f sub y,
that is going to be 0-2x 6y-2.
And what we want to do is set
these things to zero.
And we want to solve these two
equations at the same time.
An important thing to remember,
and maybe I should have told
you a couple of weeks ago
already,
if you have two equations to
solve, well,
it is very good to try to
simplify them by adding them
together or whatever,
but you must keep two equations.
If you have two equations,
you shouldn't end up with just
one equation out of nowhere.
For example here,
we can certainly simplify
things by summing them together.
If we add them together,
well, the x's cancel and the
constants cancel.
In fact, we are just left with
4y for zero.
That is pretty good.
That tells us y should be zero.
But then we should,
of course, go back to these and
see what else we know.
Well, now it tells us,
if you put y = 0 it tells you
2x 2 = 0.
That tells you x = - 1.
We have one critical point that
is (x, y) = (- 1;
0).
Any questions so far?
No.
Well, you should have a
question.
The question should be how do
we know if it is a maximum or a
minimum?
Yeah.
If we had a function of one
variable, we would decide things
based on the second derivative.
And, in fact,
we will see tomorrow how to do
things based on the second
derivative.
But that is kind of tricky
because there are a lot of
second derivatives.
I mean we already have two
first derivatives.
You can imagine that if you
keep taking partials you may end
up with more and more,
so we will have to figure out
carefully what the condition
should be.
We will do that tomorrow.
For now, let's just try to look
a bit at how do we understand
these things by hand?
In fact, let me point out to
you immediately that there is
more than maxima and minima.
Remember, we saw the example of
x^2 y^2.
That has a critical point.
That critical point is
obviously a minimum.
And, of course,
it could be a local minimum
because it could be that if you
have a more complicated function
there is indeed a minimum here,
but then elsewhere the function
drops to a lower value.
We call that just a local
minimum to say that it is a
minimum if you stick two values
that are close enough to that
point.
Of course, you also have local
maximum, which I didn't plot,
but it is easy to plot.
That is a local maximum.
But there is a third example of
critical point,
and that is a saddle point.
The saddle point,
it is a new phenomena that you
don't really see in single
variable calculus.
It is a critical point that is
neither a minimum nor a maximum
because, depending on which
direction you look in,
it's either one or the other.
See the point in the middle,
at the origin,
is a saddle point.
If you look at the tangent
plane to this graph,
you will see that it is
actually horizontal at the
origin.
You have this mountain pass
where the ground is horizontal.
But, depending on which
direction you go,
you go up or down.
So, we say that a point is a
saddle point if it is neither a
minimum or a maximum.
Possibilities could be a local
min, a local max or a saddle.
Tomorrow we will see how to
decide which one it is,
in general, using second
derivatives.
For this time,
let's just try to do it by
hand.
I just want to observe,
in fact, I can try to,
you know,
these examples that I have
here,
they are x^2 y^2, y^2 - x^2,
they are sums or differences of
squares.
And, if we know that we can put
things as sum of squares for
example, we will be done.
Let's try to express this maybe
as the square of something.
The main problem is this 2xy.
Observe we know something that
starts with x^2 - 2xy but is
actually a square of something
else.
It would be x^2 - 2xy y^2,
not plus 3y2.
Let's try that.
So, we are going to complete
the square.
I am going to say it is x minus
y squared, so it gives me the
first two terms and also the y2.
Well, I still need to add two
more y^2, and I also need to
add, of course,
the 2x and - 2y.
It is still not simple enough
for my taste.
I can actually do better.
These guys look like a sum of
squares, but here I have this
extra stuff, 2x - 2y.
Well, that is 2 (x - y).
It looks like maybe we can
modify this and make this into
another square.
So, in fact,
I can simplify this further to
(x - y 1)^2.
That would be (x - y)^2 2( x -
y), and then there is a plus
one.
Well, we don't have a plus one
so let's remove it by
subtracting one.
And I still have my 2y^2.
Do you see why this is the same
function?
Yeah.
Again, if I expand x minus y
plus one squared,
I get (x - y)^2 2 (x - y) 1.
But I will have minus one that
will cancel out and then I have
a plus 2y^2.
Now, what I know is a sum of
two squared minus one.
And this critical point,
(x,y) = (-1;0),
that is actually when this is
zero and that is zero,
so that is the smallest value.
This is always greater or equal
to zero, the same with that one,
so that is always at least
minus one.
And minus one happens to be the
value at the critical point.
So, it is a minimum.
Now, of course here I was very
lucky.
I mean, generally,
I couldn't expect things to
simplify that much.
In fact, I cheated.
I started from that,
I expanded, and then that is
how I got my example.
The general method will be a
bit different,
but you will see it will
actually also involve completing
squares.
Just there is more to it than
what we have seen.
We will come back to this
tomorrow.
Sorry?
How do I know that this equals
-- How do I know that the whole
function is greater or equal to
negative one?
Well, I wrote f of x,
y as something squared plus
2y^2 - 1.
This squared is always a
positive number and not a
negative.
It is a square.
The square of something is
always non-negative.
Similarly, y^2 is also always
non-negative.
So if you add something that is
at least zero plus something
that is at least zero and you
subtract one,
you get always at least minus
one.
And, in fact,
the only way you can get minus
one is if both of these guys are
zero at the same time.
That is how I get my minimum.
More about this tomorrow.
In fact,
what I would like to tell you
about now instead is a nice
application of min,
max problems that maybe you
don't think of as a min,
max problem that you will see.
I mean you will think of it
that way because probably your
calculator can do it for you or,
if not, your computer can do it
for you.
But it is actually something
where the theory is based on
minimization in two variables.
Very often in experimental
sciences you have to do
something called least-squares
intercalation.
And what is that about?
Well, it is the idea that maybe
you do some experiments and you
record some data.
You have some data x and some
data y.
And, I don't know,
maybe, for example,
x is -- Maybe your measuring
frogs and you're trying to
measure how bit the frog leg is
compared to the eyes of the
frog,
or you're trying to measure
something.
And if you are doing chemistry
then it could be how much you
put of some reactant and how
much of the output product that
you wanted to synthesize
generated.
All sorts of things.
Make up your own example.
You measure basically,
for various values of x,
what the value of y ends up
being.
And then you like to claim
these points are kind of
aligned.
And, of course,
to a mathematician they are not
aligned.
But, to an experimental
scientist, that is evidence that
there is a relation between the
two.
And so you want to claim -- And
in your paper you will actually
draw a nice little line like
that.
The functions depend linearly
on each of them.
The question is how do we come
up with that nice line that
passes smack in the middle of
the points?
The question is,
given experimental data xi,
yi -- Maybe I should actually
be more precise.
You are given some experimental
data.
You have data points x1,
y1, x2, y2 and so on,
xn, yn,
the question would be find the
"best fit"
line of a form y equals ax b
that somehow approximates very
well this data.
You can also use that right
away to predict various things.
For example,
if you look at your new
homework,
actually the first problem asks
you to predict how many iPods
will be on this planet in ten
years looking at past sales and
how they behave.
One thing, right away,
before you lose all the money
that you don't have yet,
you cannot use that to predict
the stock market.
So, don't try to use that to
make money.
It doesn't work.
One tricky thing here that I
want to draw your attention to
is what are the unknowns here?
The natural answer would be to
say that the unknowns are x and
y.
That is not actually the case.
We are not going to solve for
some x and y.
I mean we have some values
given to us.
And, when we are looking for
that line, we don't really care
about the perfect value of x.
What we care about is actually
these coefficients a and b that
will tell us what the relation
is between x and y.
In fact, we are trying to solve
for a and b that will give us
the nicest possible line for
these points.
The unknowns,
in our equations,
will have to be a and b,
not x and y.
The question really is find the
"best"
a and b.
And, of course,
we have to decide what we mean
by best.
Best will mean that we minimize
some function of a and b that
measures the total errors that
we are making when we are
choosing this line compared to
the experimental data.
Maybe, roughly speaking,
it should measure how far these
points are from the line.
But now there are various ways
to do it.
And a lot of them are valid
they give you different answers.
You have to decide what it is
that you prefer.
For example,
you could measure the distance
to the line by projecting
perpendicularly.
Or you could measure instead,
for a given value of x,
the difference between the
experimental value of y and the
predicted one.
And that is often more relevant
because these guys actually may
be expressed in different units.
They are not the same type of
quantity.
You cannot actually combine
them arbitrarily.
Anyway, the convention is
usually we measure distance in
this way.
Next, you could try to minimize
the largest distance.
Say we look at who has the
largest error and we make that
the smallest possible.
The drawback of doing that is
experimentally very often you
have one data point that is not
good because maybe you fell
asleep in front of the
experiment.
And so you didn't measure the
right thing.
You tend to want to not give
too much importance to some data
point that is far away from the
others.
Maybe instead you want to
measure the average distance or
maybe you want to actually give
more weight to things that are
further away.
And then you don't want to do
the distance with a square of
the distance.
There are various possible
answers, but one of them gives
us actually a particularly nice
formula for a and b.
And so that is why it is the
universally used one.
Here it says list squares.
That's because we will measure,
actually, the sum of the
squares of the errors.
And why do we do that?
Well, part of it is because it
looks good.
When you see this plot in
scientific papers they really
look like the line is indeed the
ideal line.
And the second reason is
because actually the
minimization problem that we
will get is particularly simple,
well-posed and easy to solve.
So we will have a nice formula
for the best a and the best b.
If you have a method that is
simple and gives you a good
answer then that is probably
good.
We have to define best.
Here it is in the sense of
minimizing the total square
error.
Or maybe I should say total
square deviation instead.
What do I mean by this?
The deviation for each data
point is the difference between
what you have measured and what
you are predicting by your
model.
That is the difference between
y1 and axi plus b.
Now, what we will do is try to
minimize the function capital D,
which is just the sum for all
the data points of the square of
a deviation.
Let me go over this again.
This is a function of a and b.
Of course there are a lot of
letters in here,
but xi and yi in real life
there will be numbers given to
you.
There will be numbers that you
have measured.
You have measured all of this
data.
They are just going to be
numbers.
You put them in there and you
get a function of a and b.
Any questions?
How do we minimize this
function of a and b?
Well, let's use your knowledge.
Let's actually look for a
critical point.
We want to solve for partial d
over partial a= 0,
partial d over partial b = 0.
That is how we look for
critical points.
Let's take the derivative of
this with respect to a.
Well, the derivative of a sum
is sum of the derivatives.
And now we have to take the
derivative of this quantity
squared.
Remember, we take the
derivative of the square.
We take twice this quantity
times the derivative of what we
are squaring.
We will get 2(yi - axi) b times
the derivative of this with
respect to a.
What is the derivative of this
with respect to a?
Negative xi, exactly.
And so we will want this to be
0.
And partial d over partial b,
we do the same thing,
but different shading with
respect to b instead of with
respect to a.
Again, the sum of squares twice
yi minus axi equals b times the
derivative of this with respect
to b is, I think,
negative one.
Those are the equations we have
to solve.
Well, let's reorganize this a
little bit.
The first equation.
See, there are a's and there
are b's in these equations.
I am going to just look at the
coefficients of a and b.
If you have good eyes,
you can see probably that these
are actually linear equations in
a and b.
There is a lot of clutter with
all these x's and y's all over
the place.
Let's actually try to expand
things and make that more
apparent.
The first thing I will do is
actually get rid of these
factors of two.
They are just not very
important.
I can simplify things.
Next, I am going to look at the
coefficient of a.
I will get basically a times xi
squared.
Let me just do it and should be
clear.
I claim when we simplify this
we get xi squared times a plus
xi times b minus xiyi.
And we set this equal to zero.
Do you agree that this is what
we get when we expand that
product?
Yeah. Kind of?
OK. Let's do the other one.
We just multiply by minus one,
so we take the opposite of that
which would be axi plus b.
I will write that as xia plus b
minus yi.
Sorry. I forgot the n here.
And let me just reorganize that
by actually putting all the a's
together.
That means I will have sum of
all the xi2 times a plus sum of
xib minus sum of xiyi equal to
zero.
If I rewrite this,
it becomes sum of xi2 times a
plus sum of the xi's time b,
and let me move the other guys
to the other side,
equals sum of xiyi.
And that one becomes sum of xi
times a.
Plus how many b's do I get on
this one?
I get one for each data point.
When I sum them together,
I will get n.
Very good.
N times b equals sum of yi.
Now, this quantities look
scary, but they are actually
just numbers.
For example,
this one, you look at all your
data points.
For each of them you take the
value of x and you just sum all
these numbers together.
What you get,
actually, is a linear system in
a and b, a two by two linear
system.
And so now we can solve this
for a and b.
In practice,
of course, first you plug in
the numbers for xi and yi and
then you solve the system that
you get.
And we know how to solve two by
two linear systems,
I hope.
That's how we find the best fit
line.
Now, why is that going to be
the best one instead of the
worst one?
We just solved for a critical
point.
That could actually be a
maximum of this error function
D.
We will have the answer to that
next time, but trust me.
If you really want to go over
the second derivative test that
we will see tomorrow and apply
it in this case,
it is quite hard to check,
but you can see it is actually
a minimum.
I will just say -- -- we can
show that it is a minimum.
Now, the event with the linear
case is the one that we are the
most familiar with.
Least-squares interpolation
actually works in much more
general settings.
Because instead of fitting for
the best line,
if you think it has a different
kind of relation then maybe you
can fit in using a different
kind of formula.
Let me actually illustrate that
with an example.
I don't know if you are
familiar with Moore's law.
It is something that is
supposed to tell you how quickly
basically computer chips become
smarter faster and faster all
the time.
It's a law that says things
about the number of transistors
that you can fit onto a computer
chip.
Here I have some data about --
Here is data about the number of
transistors on a standard PC
processor as a function of time.
And if you try to do a
best-line fit,
well, it doesn't seem to follow
a linear trend.
On the other hand,
if you plug the diagram in the
log scale,
the log of a number of
transitions as a function of
time,
then you get a much better line.
And so, in fact,
that means that you had an
exponential relation between the
number of transistors and time.
And so, actually that's what
Moore's law says.
It says that the number of
transistors in the chip doubles
every 18 months or every two
years.
They keep changing the
statement.
How do we find the best
exponential fit?
Well, an exponential fit would
be something of a form y equals
a constant times exponential of
a times x.
That is what we want to look at.
Well, we could try to minimize
a square error like we did
before.
That doesn't work well at all.
The equations that you get are
very complicated.
You cannot solve them.
But remember what I showed you
on this log plot.
If you plot the log of y as a
function of x then suddenly it
becomes a linear relation.
Observe, this is the same as ln
of y equals ln of c plus ax.
And that is the linear best fit.
What you do is you just look
for the best straight line fit
for the log of y.
That is something we already
know.
But you can also do,
for example,
let's say that we have
something more complicated.
Let's say that we have actually
a quadratic law.
For example,
y is of the form ax^2 bx c.
And, of course,
you are trying to find somehow
the best.
That would mean here fitting
the best parabola for your data
points.
Well, to do that,
you would need to find a,
b and c.
And now you will have actually
a function of a,
b and c, which would be the sum
of the old data points of the
square deviation.
And, if you try to solve for
critical points,
now you will have three
equations involving a,
b and c,
in fact, you will find a three
by three linear system.
And it works the same way.
Just you have a little bit more
data.
Basically, you see that this
best fit problems are an example
of a minimization problem that
maybe you didn't expect to see
minimization problems come in.
But that is really the way to
handle these questions.
Tomorrow we will go back to the
question of how do we decide
whether it is a minimum or a
maximum.
And we will continue exploring
in terms of several variables.
