So I'm going to go
through and talk
a bit about the kinds of summary
statistics we like to use now.
Frequency, accounts, mean,
and standard deviation.
So summary statistics
are numbers
that summarize properties of
the data, exactly what they
sound like.
Most can be calculated pretty
quickly in a single pass
through the data, in one
pass, which is very nice.
Most of them can be
calculated in just
about any language you care
to do them in, whether you're
doing it in SQL,
or R, or Python,
or anything else that
you care to do it.
Summary statistics are
pretty easy to calculate.
So two for categorical data, our
most common summary statistics
are frequency and mode.
So the frequency of an
attribute is the percentage
measuring how often the
value occurs in the data set.
So for example, if the
attribute is gender,
then the value female will
occur a bit less than 50%
of the time.
The value male will occur a
bit less than 50% of the time.
And something else will
occur some small percentage
of the time.
So we can think of those
numbers as being percentages.
On the other hand, the
mode of an attribute
is the most frequent
attribute value.
So in this case,
we might say aha.
In this case, we might have
something like marital status,
single, married, divorced.
Depending on our
data set, we may
want to know what the
most common value is.
Do we have mostly single
people, mostly married people,
or mostly divorced
people in our data set?
That will change the
way we look at the data.
Frequency and mode are typically
used with categorical data.
Though sometimes when
you have continuous data,
it's useful too.
Though more often when we've
got continuous attributes,
we think more in
terms of percentiles.
So this is more useful than
direct frequency or the concept
of mode, for the most part.
So percentiles are
pretty simply defined.
I have a formal definition here.
But the easier way to understand
it is by looking at it there.
So percentile is you count
the number of people who
have a smaller value than you.
And you count the percentage
of the total group
that is that number.
And you are thus
at that percentile.
So if you are the
fourth tallest person
in a group of 20th,
that means 80% of people
are shorter than you.
And it means that you are
at the 80th percentile.
And so if the height
is 1.85 meters,
then 1.85 meters is the 80th
percentile height in this group
that we care about.
