&gt;&gt; Hello, welcome to Chapter 2: Analyzing
and Representing Data with Graphs.
Let's take a look to where we are.
In Chapter 1, we studied different concepts
in relation to how to collect the data.
Sampling plans, the difference between
an observational study and an experiment,
elements of a well designed experiment.
From now on, the data will already
be in our power and we will see how
to analyze, represent, and interpret it.
This Chapter 2 focuses in some
graphs that are commonly used
to represent numerical and
categorical variables.
Some other graphs will be
introduced in the next chapters.
Here I have the ages of the 30 students
in an Introductory Statistics class.
This can be called the raw data set.
Very often we prefer to have it
in the form of a Frequency Table.
This is the Age and this is the Frequency
or Absolute Frequency, the count.
One student is 17 years old, three students
are 18 years old, six students are 19,
three students are 20, and so on.
If you add all these frequencies, you must
get 30, that's the total amount of students.
Sometimes we have more rows for other types
of frequencies, like the relative frequency.
The relative frequency is the absolute
frequency divided by the total.
For example, the relative frequency
of the age 17 is 1 divided by 30.
The relative frequency of the age 18 is 3
divided by 30, that's to say 0.1, and so on.
So the relative frequency is
in some way the percentage.
Here I wrote these tables in horizontal.
Sometimes you may find them in vertical.
The visual representation of this
frequency table is what we call a Dotplot.
Here you see one dot in the 17, three dots
in the 18, six dots in the 19, etcetera.
A classical question when you are given a
dotplot is: what does each dot represent?
In this case, each dot represents a student,
or if you prefer, the age of a student.
Sometimes if there are tons of values,
you may see a clarification saying
that each dot represents 10 individuals,
100 individuals, or something like that.
Notice that the variable of
interest is always in the X axis.
It's a numerical variable.
Notice that the raw data, the frequency table,
and the dotplot contain the same information.
The difference is that the Dotplot is a
graph and displays it in a more visual way.
Looking at the Dotplot, we can get more
easily an idea of how the whole data set is.
It is clearly right skewed.
The center must be around 25 and
there is one, or maybe two outliers.
These are the three things that you are expected
to be able to describe when looking at a graph:
the center, the shape, and possible outliers.
A common mistake when asked for the
center is saying that the center is 30.
Thirty is the center of the visible part
of the axis, but it is not the center
of the distribution, because the data is
not uniformly distributed along the axis.
There are many more values
below 30 than above 30.
The center shall be around
21, because approximately half
of the values are below 21,
and the other half above 21.
Now we are speaking about the measure
of center that is called median,
the value that is in the
middle of the distribution,
not necessarily in the middle of the axis.
In regard to shape, we call this distribution
right-skewed, or skewed to the right,
because the values are mostly in the left.
It may sound counter intuitive that we say
"right" when the values are in the left.
What happens is that the
skewness refers the tail.
It's right-skewed when the
tail goes towards the right.
By now, the concept of outlier is simple:
a value that steps out of the pattern.
In this distribution, individuals
are around here,
but this 43 is stepping out;
and maybe the 35 too.
Like I said before, this dotplot contains
the same information as the frequency table.
When we use a histogram, we lose
information; because in a histogram,
we don't work with individual values of the
variable of interest, but with intervals.
In this case the length or
width of each interval is five.
From 15 to 20, 20 to 25, 25 to 30,
30 to 35, 35 to 40, and 40 to 45.
Each one of these intervals is
commonly called bin or class.
And the point in the center is called midpoint.
The midpoint of this first class would be 17.5.
Sometimes it's used as a
representative of the whole class.
In the Y axis, we have frequencies.
Looking at here, we see that there are
10 students between 15 and 20 years old.
But we don't know if the 10 are
17 years old, or 19 years old,
or four are 18, and six are 19, or what.
That's why I said that when we work
with a histogram, we lose information.
This is usually not a problem,
because precisely we use a histogram
when we don't care too much
about those little differences.
If we have a college with 3,000 students
and we want to see the distribution
of the variable age, usually
a histogram is more advisable.
Notice that in the histogram the
shape is still clearly visible.
However, you must be careful because you
would be surprised how much the shape
of a histogram can change,
depending on the length of the bins.
Also, the center is easy to locate.
It must be somewhere in this
second class, between 20 and 25.
Outliers are not so easy to locate,
because we don't see the individuals.
Sometimes in the Y axis we have the relative
frequency instead of the absolute frequency.
In this way, this 10 becomes 10
over 30, that's to say 0.333.
Notice that the proportions
between columns don't change.
Only the Y axis is relabeled.
If we work with relative frequencies, or
percentages, we may want to use a Pie Chart.
The idea of using a Pie Chart
is that there is a single whole,
and its different segments are
represented with slides of the pie.
We would never use a pie chart to represent
the population of a city in different years,
because in this case, we don't have
a single whole divided in parts.
But if we have a city, we could use a pie chart
to represent the proportion or percentages
of the different ethnicities in the city,
because each ethnicity is
a part of the same whole.
Next we have the Boxplot.
I don't want to talk about it too much here,
because we work with it in detail in chapter 3.
But here, we can see the median clearly.
It is 21, as we guessed from the dotplot.
We also confirm that the 35
and 43 are both outliers.
Finally, we see that the
distribution is right skewed.
Here, this tail is much longer.
All these were graphical
representations of the variable age
in the Introductory Statistics class.
If the variable is not numerical,
if the variable is categorical,
we can still use a Pie Chart, like the
example of the ethnicities that I put before.
We cannot use a histogram, but we can
use something similar called Bar Diagram
or Bar Chart.
The idea is the same: in the X axis, we have
the categories of the variable of interest
and in the Y axis, we have
absolute or relative frequencies.
When we sort the categories
from the most frequent
to the least frequent, we
call this a Pareto Chart.
Let's introduce more vocabulary
with the help of Exercise 1.
Link each of the following terms with
all the graphs for which it applies.
The terms are about shape.
Left-skewed.
As you can imagine, this goes here,
because there is a tail in the left.
Right-skewed.
Here, the tail is to the right.
Symmetric.
There are here three distributions
fairly symmetric.
This, and this, and this.
Uniform. This one.
Uniform means that all the values
of the variable are equally likely
or appear with similar frequency.
Here you see that along the X axis,
the frequencies are very similar.
Multimodal.
It's this.
Here we have like two distributions combined,
each one with its own mode and shape.
Bell-shaped is this.
Here you see the bell.
Notice that both bell-shaped
and Uniform implies Symmetric,
but Symmetric implies neither
Bell-Shape, nor Uniform.
