Welcome back to Quantitative Reasoning!
A common assumption in statistics is
that quantitative data are consistent with a so-called "normal" model.
A normal model is a mathematical idealisation,
which fits many, but not all, real-world data.
In this tutorial, we learn how to draw the bell-shaped curve
that represents the probability density of a normal model,
find the area under the bell curve to the left of a given x-value,
find the x-value that corresponds to a given area
and generate random numbers that are consistent with a normal model.
In R, the mathematical equation for the bell curve
is implemented by the function `dnorm()`.
For example, here is the y-value that corresponds to an x-value of 600
on the bell curve with mean 500 and standard deviation 100.
As our textbook points out,
it's often a good idea to make a picture when working with normal models.
We can draw graphs of mathematical functions with the R function `curve()`,
which takes three arguments:
the function we want to draw,
which itself contains an argument `x` for the values to be plotted along the x-axis,
the argument `from`, which specifies the minimum x-coordinate,
and the argument `to`, which specifies the maximum x-coordinate.
For example, here is the bell curve for the normal model with mean 500
and standard deviation 100,
plotted between the x-coordinates 200 and 800.
If the normal model is appropriate for our data,
we can use the function `pnorm()`
to find the fraction of data points that are smaller than a given value x.
Let's look at the example on page 138 of our textbook.
Each part of the SAT Reasoning Test has a distribution
that is roughly unimodal and symmetric
and is designed to have an overall mean of about 500
and a standard deviation of 100 for all test takers.
Suppose you earned a 600 on one part of your SAT.
 
Let's assume that the normal model is suitable for the SAT scores.
We wouldn't have to use R to solve this problem.
The numbers follow easily from the 68-95-99.7 rule
stated on page 136 of our textbook.
Still, it's good to know that `pnorm()` can solve problems like these.
Here is the command: `pnorm(600, mean = 500, sd = 100)`.
The output confirms the textbook's solution:
"My score of 600 is higher than about 84% of all scores on this test."
We can think of `pnorm()` as the red area under the bell curve.
Note that the total area under the curve
(from x=-Infinity to x=+Infinity)
is "normalised" to be equal to 1.
With this picture in mind, it's easy to solve the next problem in the textbook.
What proportion of SAT scores falls between 450 and 600?
The answer is equal to the red area in the upper plot
*minus* the blue area in the lower plot.
Here is how we calculate the answer with `pnorm()`.
From our previous command, we subtract
`pnorm(450, mean = 500, sd = 100)`.
Our answer agrees with the textbook.
"The normal model estimates
that about 53.3% of SAT scores fall between 450 and 600."
Thanks to `pnorm()`, we can find the fraction of data points below a given score,
but sometimes we want to do the opposite:
given a quantile, we want to know which minimum score we need to achieve.
We can find the answer with `qnorm()`.
Consider this example from our textbook (page 141):
Suppose a college says it admits only people
with SAT Verbal test scores among the top 10%.
How high a score does it take to be eligible?
Here is R's answer.
We call the `qnorm()` function and use as its first argument 0.9,
which is the fraction of applicants below the score that we're looking for.
We conclude that we need a score of 628 points.
Sometimes it's useful to generate random numbers that match the normal model.
For example, we may want to confirm a mathematical result with simulations.
In R, we generate normally distributed random numbers with `rnorm()`.
Here is a vector containing 10 random numbers
generated by a normal model with mean 500
and standard deviation 100:
`rnorm(10, mean = 500, sd = 100)`.
Because these numbers are random,
your numbers are very likely to be different from mine.
You will get yet another set of numbers
when you run the same command a second time.
The mean of these numbers isn't exactly 500,
and their standard deviation isn't exactly 100.
However, if we generate a much larger sample of random numbers
(e.g. 10,000 instead of 10),
the mean and standard deviation of the sample
comes closer to the specified parameters 500 and 100.
So far, we've always specified the arguments `mean` and `sd`
when calling `dnorm()`, `pnorm()`, `qnorm()` or `rnorm()`.
Under certain circumstances, it's possible to leave out these arguments.
If the argument `mean` is missing, R assumes that the mean equals 0.
If the argument `sd` is missing, R assumes that the standard deviation equals 1.
For example, `pnorm(-1.5)`
returns the same value as `pnorm(-1.5, mean = 0, sd = 1)`.
The normal model with mean 0 and standard deviation 1
is called the "standard normal model".
It's an important special case
because the standard normal model describes the distribution of z-scores
for any normally distributed data,
irrespective of the mean and standard deviation of the data
before taking the z-scores.
Let's summarise what we learned in this tutorial.
The R function `dnorm()` implements the mathematical bell curve function
that characterizes the probability density of a normal model.
For a given x-value, `dnorm()` returns the y-value on the bell curve.
We can plot mathematical functions (e.g. the bell curve of a normal model)
with the R function `curve()`.
We compute the area under the bell curve to the left of a given x-value
with `pnorm()`.
Conversely, if we're given the area under the bell curve,
we can compute the corresponding x-value with `qnorm()`.
We generate normally distributed random numbers with `rnorm()`.
If we leave out the arguments `mean` and `sd`,
R assumes that we want to find values for the standard normal model.
As an exercise, confirm the results of the textbook's worked example
on pages 142-143 with R.
Next time, we learn how to assess
whether a normal model is appropriate for a given data set.
See you soon.
