In this video we'll define something called the cost function. 
This will let us figure out 
how to fit the best possible straight line to our data. 
In linear regression we have 
a training set like that shown 
here, with number of notation 
"M" was the number 
of training examples so maybe 
M equals 47, and the 
form of our hypothesis, which 
we use to make predictions is this linear function. 
To use a little 
bit more terminology, these theta 
zero and theta one, right these thetarized 
are what I call the parameters of 
the model, and what 
we're going to do in this 
video is talk about how 
to go about choosing these two 
parameter values, theta 0, and theta 1. 
With different choices of 
the parameters theta 0, and 
theta 1 we get different hypothesis, 
different hypothesis function. 
I know some of you will 
probably be already familiar with 
what I'm going to do on 
the slide, but just to review here are a few examples. 
If theta zero is 1.5 
and theta one is zero, then 
the hypothesis function will look 
like this because your 
hypothesis function will be h 
of &lt;span class="STsearchMatch"&gt;x&lt;/span&gt;, equals one point 
five plus, you know 
zero, times &lt;span class="STsearchMatch"&gt;x&lt;/span&gt;, which 
is this constant value function 
which is flat, at 1.5. 
If theta zero equals zero, 
that a one equals 0.5, then 
the hypothesis will look like 
this and it 
should pass through this point 2,1 
since you now have h 
of &lt;span class="STsearchMatch"&gt;x&lt;/span&gt;, really h of subscript 
theta of h, but sometimes 
I'll just omit theta for brevity. 
So J of &lt;span class="STsearchMatch"&gt;x&lt;/span&gt; would be 
equal to just 0.5 times 
&lt;span class="STsearchMatch"&gt;x&lt;/span&gt;, which looks like that, 
and finally if this equals 1, 
and theta 1 equals 0.5, 
then we end up with 
a hypothesis that looks like this. 
And it should pass through, the 
2,2 point, like so, 
and this is my new h 
of &lt;span class="STsearchMatch"&gt;x&lt;/span&gt; or my new h 
subscript theta of x. All right, whatever&lt;span class="STsearchMatch"&gt;&lt;/span&gt; 
you remember I said 
that this h subscript theta of 
x, but as a short hand 
sometimes I just write this 
as h of x.   In 
many regression, we have a 
training set, like maybe the one I've plotted here. 
What we want to do, 
is, come up with values 
for the parental state of zero, and theta one. 
So for the straight line we 
get all of this, corresponds to 
a straight line that somehow fits the data well. 
It may be that line over there. 
So how do you come up 
with values theta zero, 
theta one, that corresponds to good fitting data? 
The idea is we get 
to choose our points as data zero data 
one, so that h 
of x, meaning the value 
we predict on h of 
x, that this is 
at least close, to the 
values y for the examples 
in our training set, for our 
training example so, in our 
training set we're given a 
number of examples where we know 
x the house and we 
know the actual price it will sell for. 
So let's try to choose values 
for the parameters so that at 
least in the training set, given 
the &lt;span class="STsearchMatch"&gt;x&lt;/span&gt;'s in the training 
set we make reasonably accurate predictions 
for the y values. 
Let's formalize this. 
So then in regression, what we're 
going to do is I'm going 
to want to solve 
a minimization problem. 
I'm going to minimize over theta zero, theta one. 
And I wanted this 
to be small, right? 
I want the difference between h of x and y to be small. 
And one thing I might do 
is try to minimize the square 
difference between the output 
of the hypothesis and the 
actual price of a horse, okay? 
So, let's fill in some details. 
You remember that I was using 
the notation x i, 
y i to represent 
then the ith trainee is apple. 
So I what I want really 
is to sum over my 
training set, sum of 
i1 to m of 
the squared difference between, this 
is the prediction of my 
hypothesis when it 
is input the size of 
house number i, minus the 
actual prize that house number 
i was sold for and 
I want to minimize, some of 
the training set, sum of i1 
to m, up the 
difference of the squared 
error, squared difference to predict 
the price of a house and the 
price that it was actually sold for. 
And just to remind you, the 
notation m here was 
the size of my training set, mate. 
So, what will end there is 
that my number of training examples. 
The hash signs abbreviation. 
A number of training examples. 
Okay? 
And to make some of 
our later math a little 
bit easier, I'm actually 
going to look at, 
you know, 1 over m, times that. 
So we'll try to minimize 
the average over, we're going to minimize 1 by 2m. 
Putting the two, the constant, you 
know, one half in front of. 
It just makes some of math a little bit easier. 
So minimizing one half of something 
should give you the same values 
for the current status of zero theta one as minimizing that function. 
And just make sure that 
this equation is clear, right, 
this expression in here, h 
subscript theta of x. 
This is my, this is our usual, right? 
That's equal to this, plus theta 1 xi. 
And this notation, minimize 
over theta zero, theta one, 
just means you'll find me 
the values of theta zero 
and theta one that causes This expression to be minimized. 
And this expression depends on theta z and theta one. 
Okay? 
So, just to recap, we're 
posing this problem as, "Find 
me the values of theta zero 
and theta one so that the 
average, or really 1 
over 2m times the sum 
of square errors between my 
predictions on the training set 
minus the actual values of 
houses on the training set, is minimized." 
So this is going to 
be my overall objective function for linear regression. 
And just to rewrite this out a little more cleanly. 
What I am going to do is, 
by convention we usually define 
a cost function, which is 
going to be exactly this, that 
formula that I have up here. 
And what I want 
to do is minimize over 
theta zero and theta 
one my function, which 
of theta zero comma theta 
one, where it just merges out. 
This is my cost function. 
So this cost function 
is also called the squared error function. 
I sometimes call it 
the squared error cost function. 
And it turns out that. Why 
do we, you know, take the squares of the errors? 
It turns out that the 
squared error cost function is a 
reasonable choice and will work 
well for most problems, for most regression problems. 
There are a lot of other cost functions 
that will work pretty well, but 
the squared error cost function is 
probably the most commonly used one for regression problems. 
We can talk about alternative cost 
functions as well, but this 
choice that we just had should 
be pretty reasonable thing to try for most linear regression problems. 
Okay. 
So that's the cost function. 
So far we've Just seeing a 
mathematical definition of, you know, this cost function. 
And in case this function 
j of theta zero theta one, 
in case this function seems a 
little bit abstract and you 
still don't have a good 
sense of what it's doing, in 
the next video, in the 
next couple videos, I'm actually 
going to go a little 
bit deeper into what the 
cost function j is 
doing and try to give 
you better intuition about what 
it's computing and why we want to use it. 
