- [Brandon] Now as I'm sure
you know and experience,
the world is a very complex place.
So when we're looking to predict the value
of a variable, oftentimes we
can get better predictions
if we use more than one other variable
to make that prediction and that leads us
to multiple regression.
Now I am going to assume
you have some familiarity
and some comfort with
simple linear regression
which we covered in the previous series.
So if you're still a
bit shaky on just simple
linear regression, I
would go back, review that
and then come back to this series.
So without further ado, let's go ahead
and get to learning.
So as always, let's
start out with a problem
and a dataset that we're gonna use
for the next several videos.
And this is called the
regional delivery service.
So let's assume that you
are a small business owner
for Regional Delivery
Service, Incorporated, or RDS
for short, who offers
same-day delivery for letters,
packages, and other small cargo.
You are able to use Google Maps
to group individual
deliveries into one trip
to reduce time and fuel
costs, just like UPS would
or FedEx or the Postal Service does.
Therefore some trips will
have more than one delivery.
Now as the owner, you would
like to be able to estimate
how long a delivery will
take based on two factors,
one, the total distance
of the trip in miles
and two, the number of
deliveries that must be made
during that trip.
So we're looking to estimate
how long a delivery trip
will take based on the
distance and the number
of deliveries during that
trip, so two factors.
So to conduct your analysis
you take a random sample
of 10 past trips and record
three pieces of information
for each trip, one, the
total miles traveled,
two, the number of
deliveries during that trip
and three, the total travel time in hours,
which is what we're trying to predict.
So you make a table that looks like this.
So we have miles traveled, num deliveries,
which is the number of deliveries,
and then travel time
in hours along the top.
We have labeled them X1, X2 and Y.
Now X1 and X2 are special
types of variables
we'll discuss here in a minute.
And Y is a distinct variable,
we'll talk about here
in a minute.
So we can see the first
trip we traveled 89 miles.
We had four deliveries on that trip,
and the total time was seven hours.
The second trip was 66 miles.
We only had one delivery and
the travel time was 5.4 hours.
So here are our 10 trips.
Now remember that in this
case, you would like to be able
to predict the total travel
time, so that's the right column
so in the orange-brown color there,
using both the miles traveled,
which X1, the first column,
and number of deliveries,
which is the second column X2
of each trip.
So the question is, in what
way does travel time depend
on the first two measures, miles traveled
and number of deliveries.
So travel time is the dependent variable
and miles traveled and
number of deliveries
are independent variables.
Now one note here, some
prefer predictor variables
and response variable
instead of independent
and dependent variables respectively.
Since most stats textbooks
use independent variable
and dependent variable, I
am going to stick to that.
But I do subscribe to the
case of predictor variables
and response variable.
Just keep in mind that
depending on the textbook
you're using and your
professor and things,
you may hear both or one or the other.
So what about multiple regression?
So multiple regression
is just an extension
of simple linear regression,
again which we talked about
in the last series.
So remember in simple linear
regression, we have a one
to one relationship.
So we have a dependent
variable and we're going
to utilize an independent
variable to explain
the variation in that dependent variable
or make predictions about
that dependent variable.
Now in multiple regression, we have a many
to one relationship.
So we still have one dependent variable
but we can have two or more.
So in this case we have four on the screen
but we can have more than that or just two
or three or whatever independent variables
that are all being utilized
to explain the variation
or predict the value of
the dependent variable.
So we go from a one to one relationship,
one independent to one
dependent to two or more
independent variables and
one dependent variable.
Now having more independent
variables complicates
things a bit.
So we have to have some
new things to consider.
The first is that adding
more independent variables
to a multiple regression
procedure does not mean
or necessarily mean the
regression will be better
or offer better predictions.
In fact, doing so can
actually make things worse.
This is called overfitting.
So let's say we conduct a
multiple regression procedure
and our model explains
65% of the variation
in the dependent variable.
Well for some reason we don't like that.
We think well, we can do better than that.
So we start adding in more
independent variables.
Well adding more independent
variables will explain
more of the variation in
the dependent variable
but it can do so under false pretenses.
So adding more variables
will always explain
more variation, but it can
open up a whole Pandora's box
of other problems that we
definitely want to avoid.
So we'll talk about that more as we go,
but I just wanna float it out there
that dumping more variables
into a multiple regression
procedure is not the way to go.
The idea is to pick the best
variables for the model.
We'll talk about how to
do that in future videos.
The other concept is that the addition
of more independent
variables, see a pattern here,
creates more relationships among them.
So not only are the independent
variables potentially
related to the dependent
variable, they are also
potentially related to each other.
Now when this happens, it
is called multicollinearity.
Now it's a mouthful of a word to say
and I stumble it over it
sometimes but hopefully
we'll get better at it as we go.
So it's called multicollinearity when the
independent variables are
correlated with each other.
So the ideal, the perfect
world is for all the
independent variables to
be correlated with the
dependent variable but
not with each other.
And again, we'll talk about
overfitting more as we go.
We'll talk about multicollinearity
more as we go forward
and just keep in mind that the ideal is
for the independent
variables to be correlated
with dependent variable
but not with each other.
Now because of multicollinearity
and overfitting,
there is a fair amount of prep work to do
before conducting multiple
regression analysis
if one is to do it properly.
And in a future video,
we will walk through
all those things step
by step so that you form
the best model you can.
So things like
correlations, scatter plots,
and some simple regressions
between the independent variable
each one of them, and
the dependent variable
just to see how they're related.
So to do multiple regression
properly, really running
the multiple regression
is the very last step.
There's a lot of prep work
to do before doing that
and again we'll talk about it as we go.
So as we talked about before,
adding more independent
variables creates more relationships
among all the variables.
So we have this many-to-one relationship.
Now in our problem we
have a dependent variable
that is the travel time.
We are trying to predict the travel time
of these trips.
Now we are utilizing two
independent variables
that we selected.
We have miles traveled, that's our X1
and then we had the number of deliveries
or num deliveries which is our X2.
Now we'd like to utilize those
two independent variables
to make predictions about
the dependent variable.
Now by setting it up
this way, we also create
a third relationship, and
that is the relationship
between the two independent
variables themselves.
So we don't have just two
relationships, independent
and dependent.
We now have a relationship
between the independents.
And having that relationship
sets up the potential
multicollinearity risk.
So we're gonna have to see
when we do this problem,
whether or not these two
independent variables
are correlated with each other.
And the easy way to
think about this is that
if these two independent
variables are related
to each other, we're
really not sure which one
is explaining the variation
in the dependent variable.
So if I put sea salt and
table salt in my dinner,
all I know is that it tastes salty.
But I can't tell the
difference necessarily
between the two because they're both salt.
They have the same relationship
to my now salty dinner.
So we wanna have a distinction between the
independent variables so that it explains
something different.
We have a different relationship
with the dependent variable
over here on the right.
So we will walk through that
analysis as we go forward.
Now let's look at this situation.
So here we have one dependent variable
and four independent variables.
So we know we have the four relationships
with each independent variable
and the dependent variable.
So right there we already have
four variable relationships.
But we're not done.
We have to account for
all the relationships
between the independent variables.
And that's six more.
So now with four independent variables
and one dependent variable,
we have 10 relationships
we have to consider.
Now you can see as each
independent variable is added,
these relationships become very numerous.
So part of the art of multiple
regression is deciding
which independent variables make the cut
and which do not.
And we'll talk about
that as we go of course.
So the bottom line is that
some independent variables
or sets of independent
variables are better
at predicting the dependent
variable than others.
And some independent
variables contribute nothing.
So we'll have to decide
which independent variables
to include in our model
and which ones to exclude.
So again, the ideal is for all
of the independent variables
to be correlated with
the dependent variable,
so the orange lines,
but not with each other,
so the colored dotted lines here.
So this slide is not something
you have to really commit
to memory, but I just wanna
show you sort of where
the multiple regression model comes from.
So we have our multiple regression model
which is Y equals beta
sub zero, plus beta one X1
plus beta two, X2 plus et
cetera, et cetera, et cetera.
P just means the number
of variables we have
plus epsilon.
Now over here on the left
what we have are the sum
of linear parameters.
So if beta sub zero,
which is our intercept
and then we have beta one,
X1, which is one variable
and its weight, then we have X2, beta two,
which is another variable and its weight,
et cetera, et cetera.
So it's just the sum of
some linear parameters.
But over here on the right
we have our error term.
So we've seen this before
in simple linear regression.
So we have an intercept plus
a set of linear parameters
plus an error term.
Now for the multiple regression equation,
we have the expected value
of Y equals everything
we see up at the top but
there is no error term.
Well why is that?
That's because in the
multiple regression equation,
the error term is assumed to be zero.
So zero is zero and
therefore it's not on the end
of that equation.
The one we're gonna be
familiar with is the estimated
multiple regression equation.
So again when we're using sample data,
it's never gonna be perfect.
We're estimating, so we
have to use the estimative
multiple regression equation.
So Y hat is the predicted value of Y
equals B sub zero plus B1
X1 plus B2, X2, et cetera.
As you can see it follows the same form
as the multiple regression
equation above it
and everything that you
see at the bottom is
just an estimate of what is above it.
So B zero, B one, B two
are all the estimates
of beta zero, beta one,
beta two, et cetera
and then Y hat is the predicted value
of the dependent variable.
So again, this is not
something you need to really
commit to memory, but I just
want you to see the pattern
of how these multiple regression equations
are gonna look and we'll
talk about in the next slide
sort of what they mean.
So let's go ahead and look at an example.
So this is a multiple regression equation
you may generate based on
some analysis you conduct.
So if Y hat equals 6.211
plus 0.014 X1, plus
0.383 X2, minus 0.607 X3.
This is a standard form of a
multiple regression equation
you may generate.
Now if we look at our estimated
multiple regression equation
so we have Y hat equals B zero plus B1, X1
plus B2, X2, et cetera,
if you look at that
and compare it to the equation at the top,
you can see that they're very similar.
We just have some stand
in numbers that we have
to interpret.
So we have our variables, so X1, X2 and X3
are our variables and we can
see that they're in place
there at the top and of
course in the equation
at the bottom.
Then we have some
coefficients and an intercept.
So 6.211 is our intercept
which corresponds
with B sub zero in the equation below.
Then we have .014 there in
the blue that corresponds
to the first coefficient on the bottom,
et cetera, et cetera.
So we follow the same basic form.
Intercept plus some coefficients
paired with a variable.
So a coefficient with our first variable,
a coefficient with the second variable,
in this case a coefficient
with the third variable.
And again, these are all estimates
of the multiple regression model.
So how do we interpret the coefficients
in multiple regression?
They're interpreted a bit differently
than they are in simple linear regression.
Let's take this example.
So we have Y hat equals
27 plus 9X1 plus 12X2.
Everything we see here is
in thousands of dollars.
So X1, that's our first
variable stands for capital
investment in the thousands of dollars.
So X2 stands for the
marketing expenditures
in thousands of dollars.
That's there in the blue.
And of course Y hat is
gonna be our predicted sales
in thousands of dollars.
So everything is in thousands of dollars.
We have to keep that
in mind as we go about
interpreting it.
So in multiple regression
each coefficient,
so we have our nine and our 12 up there,
is interpreted as the
estimated change in Y
corresponding to a one
unit change in a variable
when all other variables
are held constant.
So what does that mean in this problem?
So in this example, $9,000 is an estimate
of the expected increase
in sales, which is Y,
corresponding to a $1,000
increase in capital investment
which is our X1 up there.
So remember, everything's
in thousands of dollars
so I was making sure
it's actually in dollars
here at the bottom.
So $9,000 is an estimate
of the expected increase
in sales corresponding to a 1,000 increase
in capital investment, which is X1.
Well why's that?
With the big coefficient,
with our X1 variable up there,
it is nine.
So if we increase X1 or
X1 is the number one,
we have nine times one, so
that's nine times $1,000.
That's $9,000, assuming
we hold the X2 over here
on the right constant.
And that's how we
interpret the coefficients
in multiple regression.
We could flip that and say
well, $12,000 is an estimate
of the expected increase
in sales Y, corresponding
to a $1,000 increase in
marketing expenditures
when capital investment is held constant.
So a one unit increase
when everything else,
all the other variables are held constant.
And again, we'll be doing
this more in future videos
so we'll get some practice at it.
But that's the basic idea of
how we interpret coefficients
in a multiple regression equation.
Let's go ahead and do a quick review
and then we'll be done
with this first video.
So multiple regression is an extension
of simple linear regression.
Two or more independent
variables ar used to predict
or explain the variance
in one dependent variable.
Two problems may arise
however, overfitting
and multicollinearity.
So overfitting is caused
by adding too many
independent variables.
They account for more
variance but really add
nothing more to the model.
When multicollinearity happens when some
or all the independent
variables are correlated
with each other and it
becomes hard to tell
which is actually predicting
or explaining the variance
in the dependent variable
'cause they're so similar.
In multiple regression, each
coefficient is interpreted
as the estimated change in
Y, the dependent variable,
corresponding to a one
unit change in a variable
when all other variables
are held constant.
And again, we'll practice
that more as we go.
So this was our first
video, just the very basics
of multiple regression.
We'll be doing more videos in the future
and walking through the
process by which we examine
our variables.
We look at relationships
among them before we even ever
get into conducting
the multiple regression
using a statistics package like SPSS or R
or Excel or whatever.
There's a lot of pre work
to do and that's what
we're gonna cover in the next video.
So I hope you found this
first video helpful.
If you like the video,
please give it a thumbs up,
subscribe, share it or
whatever you wanna do,
spread the word.
I just do these to help people learn.
So I hope you enjoyed it
and I'll see you again
in the next video.
(light music)
