Hi, and welcome back.
This is the main section of this course.
It is based on the knowledge that you acquired
previously, so if you haven’t been through
it, you may have a hard time keeping up.
Make sure you have seen all the videos about
confidence intervals, distributions, z-tables
and t-tables, and have done all the exercises.
If you’ve completed them already, you are
good to go.
Confidence intervals provide us with an estimation
of where the parameters are located.
However, when you are making a decision, you
need a yes/no answer.
The correct approach in this case is to use
a test.
In this section, we will learn how to perform
one of the fundamental tasks in statistics
- hypothesis testing!
Okay.
There are four steps in data-driven decision-making.
First, you must formulate a hypothesis.
Second, once you have formulated a hypothesis,
you will have to find the right test for your
hypothesis.
Third, you execute the test.
And fourth, you make a decision based on the
result.
Let’s start from the beginning.
What is a hypothesis?
Though there are many ways to define it, the
most intuitive I’ve seen is:
“A hypothesis is an idea that can be tested.”
This is not the formal definition, but it
explains the point very well.
So, if I tell you that apples in New York
are expensive, this is an idea, or a statement,
but is not testable, until I have something
to compare it with.
For instance, if I define expensive as: any
price higher than $1.75 dollars per pound,
then it immediately becomes a hypothesis.
Alright, what’s something that cannot be
a hypothesis?
An example may be: would the USA do better
or worse under a Clinton administration, compared
to a Trump administration?
Statistically speaking, this is an idea, but
there is no data to test it, therefore it
cannot be a hypothesis of a statistical test.
Actually, it is more likely to be a topic
of another discipline.
Conversely, in statistics, we may compare
different US presidencies that have already
been completed, such as the Obama administration
and the Bush administration, as we have data
on both.
Alright, let’s get out of politics and get
into hypotheses.
Here’s a simple topic that can be tested.
According to Glassdoor (the popular salary
information website), the mean data scientist
salary in the US is 113,000 dollars.
So, we want to test if their estimate is correct.
There are two hypotheses that are made: the
null hypothesis, denoted H zero, and the alternative
hypothesis, denoted H one or H A. The null
hypothesis is the one to be tested and the
alternative is everything else.
In our example,
The null hypothesis would be: The mean data
scientist salary is 113,000 dollars,
While the alternative: The mean data scientist
salary is not 113,000 dollars.
Now, you would want to check if 113,000 is
close enough to the true mean, predicted by
our sample.
In case it is, you would accept the null hypothesis.
Otherwise, you would reject the null hypothesis.
The concept of the null hypothesis is similar
to: innocent until proven guilty.
We assume that the mean salary is 113,000
dollars and we try to prove otherwise.
Alright.
This was an example of a two-sided or а two-tailed
test.
You can also form one sided or one-tailed
tests.
Say your friend, Paul, told you that he thinks
data scientists earn more than 125,000 dollars
per year.
You doubt him so you design a test to see
who’s right.
The null hypothesis of this test would be:
The mean data scientist salary is more than
125,000 dollars.
The alternative will cover everything else,
thus: The mean data scientist salary is less
than or equal to 125,000 dollars.
It is important to note that outcomes of tests
refer to the population parameter rather than
the sample statistic!
As such, the result that we get is for the
population.
Another crucial consideration is that, generally,
the researcher is trying to reject the null
hypothesis.
Think about the null hypothesis as the status
quo and the alternative as the change or innovation
that challenges that status quo.
In our example, Paul was representing the
status quo, which we were challenging.
Alright.
That’s all for now.
In the next lectures, we will see some examples
and learn how to make data-driven decisions.
