Welcome back to Quantitative Reasoning!
In this tutorial, we calculate z-scores with R.
Our textbook uses heptathlon scores in the 2012 Olympics
as a motivating example to introduce z-scores.
You can find the data at the URL linked below this video.
Let's import the data and shorten the name of the data frame to `hept`.
The data frame has three columns.
The first column contains the names of the athletes.
The second column shows their performance in the 200-metre run in seconds.
The third column stores the long-jump results in metres.
Our textbook compares the results of two athletes:
Jessica Ennis and Tatyana Chernova.
By the way, Chernova was later disqualified because of doping,
so her results shouldn't be viewed as signs of athletic achievement,
but let's suppose we could treat the numbers at face value.
In this parallel universe,
we would conclude that Ennis is the better runner,
and Chernova is the better long-jumper.
Which performance is more remarkable?
The z-score is a way to standardize the performance.
It's defined as z = (y - y_bar) / s,
where y is a data value,
y_bar is the mean of all data values,
and s is their standard deviation.
I'll keep this definition visible in the bottom right.
The translation of this equation into R is straightforward.
Let's add a column `z_run200` with the z-scores of the 200-metre run.
When we look at the spreadsheet, we notice that all z-scores are `NA`.
The problem is
that some athletes either didn't start or didn't finish the 200-metre run.
Their performance appears as `NA`.
We can confirm the existence of `NA`s with
`any(is.na(hept$run200))`.
On one hand, R is correct to point out
that we can't determine a z-score if we don't know all values.
On the other hand, in this application,
it's sensible to exclude missing values from the analysis
and calculate the z-score for all athletes that started and finished the race.
From tutorial 18,
we know that we can remove missing values with `na.rm = TRUE`.
We can calculate the z-score for the long jump similarly.
The commands for calculating z-scores is quite long,
so we might wonder whether R has a built-in function for z-scores.
Unfortunately, the answer is no.
R's base installation contains a function called `scale()`,
which returns the z-scores, but not as a vector,
so it isn't directly useful for us.
In later tutorials, we'll learn how to write our own functions,
so we'll be able to write a z-score function ourselves.
Until we reach that level,
I recommend to use the explicit method shown here.
Let's take a look at the z-scores for Ennis and Chernova.
We conclude that Ennis ran the 200 metres
2.1 standard deviations faster than the mean.
 Chernova jumped 1.1 standard deviations better than the mean.
These numbers are consistent with the results stated in our textbook.
In summary, we compute the z-score of a numeric vector `v` with
`(v - mean(v)) / sd(v)`.
If `v` contains missing values
and if it's appropriate to remove them from the analysis,
we include the argument `na.rm = TRUE` in the `mean()` and `sd()` functions.
Next time we learn how to work with the so-called normal model in R.
See you soon.
