&gt;&gt; Exercise 6 summarizes
different concepts
about correlation
and regression.
Here we have a dataset
that relates cigarettes smoked
a day with life expectancy.
Here we have the scatter plot.
Question a, what
are the explanatory
and response variables?
This is easy.
The explanatory variable
is number
of cigarettes smoked a day.
The response variable
is life expectancy.
Question b, describe the
direction, shape, and strength
of the relationship
between cigarettes a day
and life expectancy.
Remember that you cannot answer
this question by just looking
at the table or the
correlation coefficient.
You need to get the
scatter plot.
In the scatter plot, we see
that this relationship is
negative, linear, and strong.
Question c, use a statistical
software to find the equation
of the regression line
between cigarettes a day
and life expectancy.
The regression line is the line
that best fits the dataset.
We could call it
the optimal line.
I have created tutorials
where I teach how
to use a statistical software
to obtain things like this.
Watch my tutorials if you need
to learn how to use
the software.
With the software, you
will see that the equation
of the regression line is this.
Usually, books and softwares
don't show these predicted.
I like including it because
the regression line is just a
mathematical model.
It's not the reality.
What we get with the
regression line are predictions.
It's always a good
idea to sketch the line
in the scatter plot
like I did here.
Question d, interpret the slope.
A silly mistake that I see here
very often is writing only the
number, negative 0.696.
This is the slope, yes, but
you are asked to interpret it.
So how do you put this
number into a sentence?
The general algebraic
interpretation
of a slope is the increase in
the y per each unit of increase
in the x. Since the
x here is number
of cigarettes smoked a day,
and the y is life expectancy,
the interpretation would be per
each cigarette smoked a day,
the life expectancy is
expected or is predicted
to decrease by 0.696 years.
Again, I am including here
we expression is expected
or is predicted because this is
not the slope of the reality.
It's only the slope of
our mathematical model,
the slope of our predictions.
It's a sort of average of what
we observed in the sample.
Notice that here, the word
decrease replaces the negative
sign, and we write then the
absolute value of the slope.
Question e, interpret
the y-intercept.
The general interpretation
of a y-intercept is the value
of the y when the x is zero,
that's to say in the origin.
Very often, the interpretation
does not make sense
for a regression line
because the zero is too far
from the dataset
and it might be kind
of absurd considering
x equals zero.
But in this case, it does
because a person can smoke
zero cigarettes a day.
The y-intercept here
is 84.76 years.
So, the interpretation would
be, the life expectancy
of a nonsmoker is expected or
predicted to be 84.76 years.
Question f, find the correlation
coefficient and interpret it.
You do this with a software.
It is negative 0.83.
The scatter plot shows that
the relationship is linear.
So we can interpret the
coefficient in this way.
The relationship between
cigarettes smoked a day
and life expectancy is
negative, linear, and strong.
Question g, find the
coefficient of determination.
Interpret it.
The coefficient of determination
is simply the score
of the correlation coefficient.
In this case, it is 0.69.
Here a warning, usually
it is completely different
from the slope.
In this particular
exercise, coincidentally,
they are very similar.
Many authors interpret the
coefficient of determination
as the proportion of the
variation in the response
that can be explained or
predicted or determined
by the explanatory variable
and our mathematical model.
In this case, we can
say, 69% of the variation
in life expectancy can
be predicted or explained
or determined with the number
of cigarettes smoked a
day and our linear model.
In other words, it's clear
that life expectancy varies.
For some people, it's 78 years.
For some other people,
it is 60 years.
For others, it is 55 years.
What are the factors
involved in that variation?
There is one factor that is so
involved as to let us predict
or account for 69%
of that variation,
that's to say a great
majority of the variation.
And this factor is the number
of cigarettes smoked a day.
What happens with the
other 31% of the variation?
There are other factors
or confounding variables
involved in life expectancy.
Maybe we could predict or
explain or determine more
of the life expectancy's
variation
if we considered
these other factors.
What I don't like about the
expression explained by is
that it seems to
imply causation.
And the coefficient
of determination has
nothing to do with causation.
That's why I might suggest
to better use the ware
to predict or to determine.
By the way, using the ware
to determine makes sense
of the name of the coefficient.
Question h, if a friend of yours
smokes 29 cigarettes a day,
what's the prediction of
their life expectancy?
This is the kind of
question that we answer
with the regression line.
The value of the
explanatory variable is given.
We plug it in the equation
of the regression line,
and it gives us the value of
the response, 64.58 years.
We were anticipating
a low life expectancy
because this friend
smokes so much.
Interestingly, the 29 was one
of the values in our sample.
So we can compare
this prediction
with the actual observed value.
This brings question h and
the concept of residual.
Notice that the life expectancy
observed in the sample
for 29 cigarettes a day is 58,
even lower than our prediction.
The definition of residual is
the observed value minus the
predicted value.
The observed value
for 29 cigarettes a day was
58 minus the prediction given
by the regression line
equals negative 6.58.
We can graph this residual
in the scatter plot.
For 29 cigarettes a day,
the point in the regression line
is the prediction, and the point
in the cloud is the observation.
This difference or
distance is the residual.
It's important to speak
of these residuals
because minimizing data
scores is the criteria
with which we select the optimal
line, the regression line.
In other words, you could
think that maybe a line
like this is a better fit
because it crosses
more observations.
Certainly, the residual
for these observations
would be close to zero.
But if you consider all
the other residuals scored,
you will accumulate more
error with this line.
In this criteria, we
score the residuals
to prevent the negative ones
from canceling the
positive ones, same as we do
in the standard deviation, as
we saw in Chapter 3 Part C.
So you can bet that this
line, the regression line,
is the optimal line
for this dataset.
Optimal according to what?
According to minimizing
the sum of score residuals.
As a final idea, this sum of
scored residuals is used also
as a measure of appropriateness
of the model.
That's to say, if
you use that criteria
to find the optimal
exponential model, you will see
that that optimal exponential
model generates a greater sum
of scored residuals
than this linear model.
It seems clear in
the scatter plot
that this is more
linear than exponential.
But the sum of the scored
residuals provides a numerical
measure of that.
Question j, could we predict
the life expectancy of a person
who smokes 65 cigarettes a day?
You can always plug the
value in the equation
of the regression line.
The problem here is
that the highest value
of the explanatory variable
in our dataset is 43.
If we predict beyond this
point, we are assuming
that the pattern
continues the same way.
But we don't know that.
So we cannot trust
much this prediction.
This is called extrapolation,
predicting beyond the
range of our data.
And it is always less
reliable than interpolation,
which means predicting
within the range of our data.
The classical example
is the relationship
between age and height.
It is positive, linear,
and very strong
from three to 16 years old.
But we cannot use
that regression line
to predict the height
of a person who is 40
because the pattern changes
drastically after 17.
People stop growing.
The prediction for the
40-year-old guy could be 10
or 12 feet, or three
or four meters.
This is another reason why the
y-intercept usually doesn't make
sense, which we mentioned in
question e. When the zero is far
from the dataset,
it usually happens
that the relationship is very
different close to the zero.
Question k, does
this study prove
that smoking reduces
life expectancy?
What lurking variables
could explain the
observed relationship?
The first answer
is obviously no,
correlation never
proves causation.
This reminds you of what
we saw in Chapter 1,
that an observational study
can never prove causation.
The second question introduces
the concept of lurking variable.
For many authors, it's just a
synonym of confounding variable,
which we also saw in Chapter 1.
Interestingly enough, they tend
to introduce them in
different moments.
They usually speak of
confounding variables
in the context of experiments
and lurking variables
in the context of correlation.
It seems to me that
it is more common
to call confounding variables or
confounders to those variables
that operate over the response
together with the explanatory.
This is the diagram that
I used in Chapter 1.
Then I defined a confounding
variable as any variable
that could affect the response
and is not the explanatory.
But the confounding
variables don't need
to affect the explanatory.
They are just there with it.
A different structure is
this, when we have a variable
that affects both at a
time, the explanatory
and the response variables.
In cases like this,
the explanatory
and response variables
change together.
Not because one is the
cause of the other,
but because there is a
third one causing both.
A nice example that a
colleague of mine uses is
when his boy was born,
he planted a tree
in their backyard.
During the first eight
years, the correlation
between the height of
the tree and the height
of the boy was almost
perfectly linear,
with a correlation
coefficient of almost one.
But neither the tree was
responsible for the child,
nor the child was
responsible for the tree.
So what variables could
explain the strong correlation?
Probably the most
important is time.
Since time passed for both
and both were living
beings, they grew together.
We could also mention
the father,
the father was taking care
of the tree and the son.
It seems to me that the concept
of lurking variable is most
often used in situations
like this, where one
variable is hidden,
affecting both the explanatory
and the response,
like a puppet master.
Anyway, since these two
terms are commonly used
interchangeably, you
might be entitled
to use the one you prefer.
Just in passing this
example of the tree
and the boy is a good moment
to bring back the concept
of coefficient of determination,
which we introduced
in question g. It
was also very high
in this case, let's say, 0.95.
The question that this
coefficient answers is how much
of the growth of the
boy can be determined
if we know the growth
of the tree?
The answer is a lot.
If you tell me how tall the
tree is, I will put this height
in the regression line,
and I will tell you almost
exactly how tall the boy is.
So the interpretation
of the coefficient
of determination makes much
sense here, 95% of the growth
of the boy can be determined
with the help of the tree
and the regression line.
Again, this does not imply
that the tree is the
cause of the boy's growth.
Going back to question
k, some lurking variables
that could explain
the correlation
between cigarettes smoked a day
and life expectancy
are lifestyle,
mindset, or personality.
Let's say that John is committed
to a healthy lifestyle.
Because of this, he does not
smoke, but also because of this,
he takes care of his diet,
he exercises regularly,
and all these may also
increase his life expectancy.
One final thought.
Here, we are not saying
that smoking cigarettes does
not reduce life expectancy.
What we are saying is that
these cannot be proved
with just a correlation.
