This is a video on statistics.
What I'll be talking about today
are all the different ways
of describing data.
Often we call this descriptive statistic.
We will look at statistics which are the numbers you can get
from data.  We will also looking at charts and graphs.
We'll start out with measures of the center.
That will be the mean, the median and the mode
Then we will look at the measures of the spread.  The main one we'll look at is the standard deviation.
Then we will look at measures of relative standing
That will be the first quartile and the third quartile and the interquartile range and percentiles.
Then I will look at histograms.
Then we will look at other charts such as the
stem-and-leaf plot or the box and whiskers plot,
Pareto diagrams and a bunch of other charts.
Let's go right to measures the center.  We start out with the big one.  That's the mean.
The mean, you all know, at least you know it as the average
and the average means add them all up and divide by the total.
A big part of this class
is having the population and then saying we can't get all the values.
Instead we will take a sample.
for the population mean Wimbledon 05 this Greek letter mu
The formula for that is add them all up, means add them up, x or all the data values,
and then one over n  means divide by
n is the total number in the population.
We typically use of capital N for that total number.
On the other hand, the sample mean
we use this right here
That is x with a bar over it, a fancy way of saying that is xBar.
It has the same formula.
We add up all the data values
but in this case instead of dividing by capital N we divide by little n,
because we know the sample size
instead of the population size for the data.
That's the mean.  You have worked with the average before so I will not spend a lot of time
explaining what the average means.  I will assume you know that.
Similarly the median and the mode you had that in grade school.  The median, just a reminder, is the middle number from the data
or if there's an even number
of values you take the mean of the middle two numbers.
so when should you use the median and when should you use the mean?
The median is often used for home prices or income or birth weight.
One of the things that these have in common, so for example in my town of Tahoe
most of the houses tend to be between $250,000 and
$600,000.  Then you have those beautiful mansions that have lakefront property. = They go for $15,000,000.
If you took the mean,
You end up with a mean that is nowhere near any of the houses in value.
it might be $1,000,000 for the mean and that doesn't represent anything really.
It doesn't represent the typical house which is like where I live in that
250 to 600 thousand dollar range.
It also doesn't represent the mansion.  It's not that useful when you want to represent the typical house.
It also doesn't represent the mansion.  It's not that useful when you want to represent the typical house.
On the other hand the median will give you what a typical house is worth and that's more like where I
live in a neighborhood not on the lake, the big mansion.
Income is the same idea.
If you look at
a standard business, for example,
most people make somewhere between, I don't know, $10 and $30 an hour but then there's the CEO,
the owner of the company who might be making millions of dollars a year.
If you take the average,
that will not represent the regular person works at the company.
that will not represent the regular person works at the company.
because this CEO of the company is making so much that CEO is an outlier which skews the data or skews the mean.
upward.
The medium, on the other hand, will represent the typical worker at the company.
We can also talk about the mode.   The mode is just the value or the values that occur the most frequently.
Let's move on to measures of the spread.  The big one is called the standard deviation.
The standard deviation is a measure of the spread of the data.
It's kind of an average of how far the data values are from the mean.  We can talk of the sample standard deviation
which we will use the letter s for, lower case s.
I don't expect you to memorize these formulas,
but I do want you to understand kind of where they are coming from and what they do.
Similarly for the population standard deviation, I use the Greek letter sigma.
Again it has a messy formula.
Why does it measure an average or the spread of the data?  Let's look at this x minus mu or x minus xBar.
x minus mu, you'd think about as the difference between the data value
and the mean.
The square and square root makes it positive, so we really are talking about the distance from the mean
and adding them up and dividing by the total, that's an average.
That's why they can say this is
in a way an average distance
from the mean that the data values
lie.
Notice that the sample standard deviation has an n  - 1 in the denominator
and the population standard deviation has an n in the denominator.
I don't care if you completely understand why you would do that,
but I do want you to know that they are different.
In a calculator, for example, it's going to have to give us sigma,
the population standard deviation,
and it will have to give us s,
the sample standard deviation, because they will be different numbers.
The mean, on the other hand, has the same formula whether you are talking about a population or a sample.
The calculator will only give us one number and it will usually just say xBar which will mean the sample mean,
but also if you are talking about the population mean, it will mean the population mean even if it doesn't say mu.
The variance is the square of the standard deviation.  I will not get into what the variance is all about
and how to interpret it.  It is just a good number to know and that's all I will say about it.
Here's an example.  Let's suppose that a biologist measures the height in meters
of six giraffes.
The measurements were 5.2,
4.9, 5.4,
5.0, 5.2 and 5.5
in meters by the way.
Let's use our calculator
to summarize the stats, to find all of the statistics.
I do that.
Here's the calculator.
I first need to enter the data.
So I go to STAT.  That's this button right here: STAT.
I want to edit the data values into the editor.  I  hit enter on EDIT.
Often the calculator will have other numbers in there. I need a clear them out and here's how I do that.
I go up to L1.
and I hit the CLEAR button.
It is very important to never ever hit the delete button.
You will mess up your calculator and erase L1 forever and unless you are really good with your calculator, you
won't be able to figure how to get it back and your calculator is ruined.
You'll be coming to me crying. So  don't ever ever hit the delete button.  Instead use the clear button when you're in the editing mode.
Then hit ENTER.
Now I just type in my data
values and hit ENTER after each one. My first is 5.2
Hit ENTER.
Then 4.9.
ENTER
And then 5.4.
ENTER
Then 5.0.
ENTER
Then 5.2
ENTER
and then finally 5.5.  ENTER.
Now that I have all of my data values entered into the editor.
I hit STAT again.
I hit the right arrow.
I go to Calc which is Calculate.
I only have one variable
and I want statistics so I hit ENTEFR  on 1-Var STATS, or I can hit the number one
Then I go and hit 2nd
1.
This will type in L1.
Then I hit ENTER
And I have my statistics.
Notice it tells us that xBar
is 5.2.  The calculator does not know whether we typed in a population set of data or a sample set of data.
we have to understand that for cells so for example
if these six giraffes
were six randomly selected giraffes from
Africa then this would be a sample, and yes,  xBar would be 5.2.
If these 6 giraffes were all six giraffes at our zoo, and we want to find out the population mean
height of giraffes at our zoo,  then it would be mu is 5.2.
Sigma x and Sigma X squared, I will not do anything with in this whole class, so I will ignore those numbers.
So then again if you're talking about a population or if you are talking about a sample, it's read differently.
are thinking about this as a sample of giraffes in Africa then we use Sx.
which is s which is the sample standard deviation or about 0.228 or so.
If we are thinking about these giraffes as
the population of all the giraffes at our zoo,
then we use sigma x which is the population standard deviation and that's about 0.208
n was our sample size or our population size
and that was 6.
I hit the down arrow.
The minimum or the shortest giraffe was 4.9 m.
Q1 is 5.  That's the first quartile.  I will get into that later in this lecture.
But for now just know that in your calculator
you can get the quartiles by using 1-Var Stats.
I will only be doing examples with the calculator once in this lecture.
Just remember for later on
this is how you get the quartiles.
Med
not stand for Club Med.
This is the media and the median was 5.2.
Q3 is what we call the third quartile.  That is 5.4.
And the maximum
was 5.5.  So the very tallest of all the giraffes was 5.5 m tall.
If I wanted to get the variance I would have to go back to the standard deviation
and if I round to one decimal place, I can get away with using either of these.  It is about 0.2.
and remember the variance in the square of the standard deviation.
I have to type that in as 0.2
and I hit x squared and hit ENTER.
The variance is 0.04.
Those are the statistics or the parameters depending on whether we're thinking of this as a sample or a population.
Let's go back to the PowerPoint.
Here's the PowerPoint.
Next I will remind you again about what the standard deviation is.
I will also go over something called the standard error.
The standard deviation, I have already talked about, but
I can't talk about it too many times.  The standard deviation I would say is in the top three of the most important ideas of this entire statistic
course.
The standard deviation is a kind of average distance that the data values from the mean.
The standard error, on the other hand, is also very important
but it is something that we will spend a lot of time later on
talking, but today I will briefly say it just to say that you've seen it once.
For a sample size of size n, the standard error which measures the sampling variability of the statistic is given by sigma sub xBar
is equal the sigma divided by the square of n.  Again, you have seen it once and when we get to talk about this three weeks from now
you'll have seen it and then we'll talk about it in detail and really go over what it means.
The next topic is z-scores which are actually very important and what they do is a combining the
value, the mean and the standard deviation all in one formula
that will help us really understand how unusual or how far
the value is from the mean.
The z-score is defined by this formula.
z is equal to
x minus mu
divided by the standard deviation.
and it tell us the number of standard deviations from the mean.
Let me give you an example.
Last quarter's statistics final had a mean score of 76%
and a standard deviation of 5%.
The psychology final had a mean score of 84%
and a standard deviation of 6%.
if Juan scored 86%
on his statistics final and 93% on the psychology final,
which final did he do better
on
compared to the rest of his class?
So clearly Juan scored higher on the psychology final.
But the psychology final was easier because it had a mean of 84
and the statistics final had a mean of 76.
So you really can't say, well 93 is higher than 86, so Juan did better on the psychology final.
The z-score
is a way of comparing two different scores or two different values
that come from different sets of data.
So let's use z-scores to compare.
For statistics
the mean was 76,
the standard deviation was 5,
and Juan's score was 86.
Let's calculate the z-score
which is the observe value 86 minus the mean 76
divided by the standard deviation 5.
I put that into my calculator and I got 2.
For Juan's psychology final,
the mean was 84,
the standard deviation was 6,
and Juan's score was 93.
Let's put that into the z-score formula.
z was (93 - 84) divided by 6
and I got 1.5.
Notice that Juan's z-score for the statistics final
was higher than Juan's z-score
for
the psychology final.
We can conclude that compared to the rest of his class
Juan did better on his statistics final than he did on the psychology final.  So for example if they graded on a curve Juan's grade would be higher
in statistics than
in psychology even though
his score on the statistics final
was lower than the score on the psychology final.
That's how we can use z-scores.  Other examples, for example,
I got 1140 on my SAT exam.
If you know anything about SAT exams today
that would be a horrible score today.
but
my score on my SAT exam
my z-score
was about 3.5.
which is actually a very very high z-score.
so compared to the rest of the people
that took the SAT when I took it
I did really really really well.
Had that score been today, it would be terrible.  But the the z-score is what matters.  That is really good.  Let's move on
and talk about percentiles.
Percentiles are another way
of dealing with the fact that if you don't know anything about the raw school, you can talk about how high or how low it is.
The percentile
is defined by taking all of the data
and dividing it in 100 groups  with about 1% of the data for each group.
The formula is that the percentile is the number of values less than
the score
divide the total number values times 100.
There are a couple important percentiles.  The first is the first quartile
The first quartile which we will call Q1, we saw on the calculator, is the 25th percentile.
The third quartile Q3 is the 75th percentile.
Notice that there is no Q2, because instead of Q2, that would be the 50th percentile,
but that's the median, and we just use the median for the 50th percentile.
Finally the interquartile range which is denoted as IQR
is the difference between the first and third
quartiles.
The IQR is used if you are presenting to a sophisticated audience
that understands this stuff.  It really tells you what that middle 50 percent of the data is.
Percent as a great.  Things like SAT scores.
Percentiles are much better than raw scores because they tell you where you stand compared to the rest of the class.
My percentile my sat score was
and 99th percentile.  I did real well on my SAT score.
Especially, as you might guess, I did really really really well in math.  English was just ok.  It was good but in math I had like a 99.99 percentile.
So you know what that means.  You will also see birth weight given in percentiles.
For example if
a child was born in the first percentile then that tells you that child is very very small.  It probably will have to go to the ICU ward and get
really taken care of because the child is in danger at the first percentile.
We will see percentiles given in other places too. Anytime when you want to compare
that value to the rest of the group.
A boxplot is a
graph
that describes
all of the five point summary values.  The five point summary values are the minimum,
the first quartile Q1,
the median,
Q3, the third quartile and the maximum.  You can get all these numbers with a calculator.
using the 1-Var Stats that we saw before.
To create a boxplot we will put hirizontal bars at each of these values.
We will connect Q1 and Q3.
To make a box and we will draw lines from the minimum to Q1 bars and from the maximum to Q3 bars to make the whiskers.
It is pretty hard to read this and get the idea.
Let's do one.
Suppose we have the five point summary with a minimum of 4,
the first quartile of at 7, the median at 9, the third quartile at 15 and the maximum of 20.  Start by putting all these on a number line.
Here is the number line with 4, 7, 9, 15 and 20 drawn on a number line.
then I will draw vertical line segments above each.
Above the 4, above the 7,
above the 9, above the 15,
and above the 20.  I draw a vertical line segment.
they I draw my box.  The box connects the first quartile to the third quartile.  I connect the bottom
of the box by connecting 7 to 15
at the bottom.
I connect the top of the box
by connecting 7 to 15 at the top of those bars.
then I draw was is called the whiskers.  I draw the left whisker by connecting 4 to 7 in the middle of the bars
and I connect 15 to 20 in the middle of those two bars.
and that's my boxplot.  Sometimes we also call this a box and whiskers plot,
because we have our box which is the interquartile range box and we have the whiskers that go from
the minimum to Q1 and Q3 to the maximum.
The stand out of the boxplot is the box.  That's why we call it a boxplot.
Very easily, we can see where the middle of the data values.  The middle 50% of the data lies between 7 and 15.
The median is 9.  Sometimes we put a dot at the mean because the mean is important.  Sometimes you don't.
Sometimes we will take the outliers
and will put dots at the outliers
and we will draw a boxplot for the rest of the data.
But here I just drew a simple box and whiskers plot.
the next sketch I want to talk about is the histogram.
A histogram is a bar chart
that depicts the frequency distribution table that the we talked about the last chapter.
Here's an example.  The number of units someone might be taking
vs. how many people to those number of units.
A relative frequency histogram
is a bar chart that depicts the relative frequency distribution table.
The relative frequency numbers
you find by taking the frequency values so, for example, 130 and divide by the total which was
500 for the sample size.  If you take 130 and divide by 500 you get 0.260.
You can do that for each of the frequency numbers and you get the relative frequency numbers.
I can draw a histogram of the relative frequency values.
similarly we can talk about the cumulative frequency histogram.  The cummulative frequencies
are the frequencies
that occur at or below that value.  So for example
there were 463 values at or below 179.
To find the cummulative frequency we start out with the first frequency
and then we add up
the first two frequencies to get the cumulative frequency of the second value
And we keep going.   Notice that the last cumulative frequency value will always be the sample size
because the number at or below that value is just the total number of numbers.
than finally the relative cumulative frequency histogram is a bar chart that depicts the relative cummulative frequency distribution table.
To get the relative cumulative
frequency
we take the cumulative frequencies and divide by the sample size
so for example 284 divided by 500 is 0.568.
Here's an example of a histogram.
This is actually a histogram of z-scores.  I put this in to remind you about z-scores because they are so important.
This is the histogram that shows you the difference in height between fathers and sons
and it is a histogram of the z-scores of those differences.
Notice a few things.
One it is we can look at the histogram and we can see that it looks pretty symmetric.  It looks like
they're a lot of z-score near zero, not so many far away.
That's typical when you're talking about z-scores.
Also, if you want to find out
what the lowest z-score was or what the closest the score was to zero, you actually can't find that out.
What you can say is that there were a lot of z-scores between zero and say this looks like
0.25.
But you can't say there was a z-score at exactly 0.2.
You lose the data values when you draw a histogram.  You get the shape of the data really well, you can identify outliers pretty well,
and
you can also identify where the highest frequency occurs.
where you don't have so easily is what standard deviation is.  You can make a guess.  We can say the standard deviation was probably
somewhere around maybe 0.8 or so, because most of the data
is between -0.8 and 0.8.
So that would be the standard deviation.
Notice that -4would be a strong outlier.  It's really far away from the rest of the data values.
probably -3
would be an outlier.
We've seen that before.
I want to talk about looking at the shape of distributions.
If you have a histogram
where you have the highest frequency on the right
and we have a tail moving to the left,
we will call that negatively skewed or skewed left.
when you have such an instance that tale is going to pull the mean towards the tail
away from the median.  The median is over here I will call it x ~.
and the mean xBar
will be over to the left when we have a skewed left distribution.
Similarly a skewed right distribution
means that the high frequency is on the left and the tail is to the right.
That right tail here will pull the mean
to the right of the median.
On the other hand if you have a symmetric distribution
and that means that
we're talking about the left and the right hand side looking the same
the mean and the median will be the same.
Let's next talk about modality.
Remember mode
is the highest frequency value.
If there is just one mode like in this example where there is just one high bar and then low bars to left
and right then we call that a unimodal distribution.
By the way we saw that skewed left and skewed right
By the way we saw that skewed left and skewed right
those were also unimodal and so was that symmetric one we saw was unimodal.
Bimodal
means that it has two modes.  They don't have to be exactly the same height,
but you have two tall
frequency bars that are separated from each other.
That's a bimodal distribution.  You'll see that for example if you talk about gender, so often if you are
looking at the heights of men and women you have the
men's typical height and the women's typical height.  That will be very different from each other.  There will be two modes.
Whereas unimodal means that you're not really taking from two different types of people or two
different types of things.  You are taking typically from one
type of population and you have that one big mode in the center or the mode can be at the left or right.
if you happen to have more than two modes we will just call it multimodal.
if you happen to have more than two modes we will just call it multimodal.
An important distribution I want to talk about is called the uniform distribution.
A uniform distribution means
that
whether you are far left, far right or in the middle, the frequencies are basically the same.  So this occurs for example, if you look at the number Pi,
you remember was about 3.14.  You might remember that.  It is actually 3.141592658… It goes
on and on.  If you looked at the first three billion digits of the number pi,
and you drew a histogram of those digits it's going to look very uniform.  The number of zeros and ones and twos, all the way up to nine
all occur at
about the same frequency.  This, by the way, is a relative frequency
histogram because I'm looking at the relative frequency, about 10%
for each of the digits
for each of the digits
occurs for the number pi.
occurs for the number pi.
The next distribution which is the most distribution
of this entire course is called the normal distribution.
The Normal distribution has some very important attributes.
The first is that it's symmetric about its mean.  Here's the mean and we see the highest frequency
is right here at the mean and the left hand side in the right hand side are mirror images of each other.
Secondly,
It is a unimodal distribution.  There is only one hump and that hump does occur right at the center.
Then finally
it is bell shaped.
It looks like a bell if you kind of draw what this histogram is.  It looks like the bell.
This is an example of a histogram that is approximately normal.  We will get into this in much much more detail
later on.  We will have an entire chapter
just to talk about the Normal distribution.
We'll find out later that Normal distributions occur especially when we're looking at what are called sampling distributions.
If we are looking at all possible sample means
from a population, they follow a Normal distribution.  If you don't quite know what I just said, that's OK.  It won't be OK in a few chapters
when that's all we talk about.
If we happen to have a Normal distribution
we will have what is called the Empirical Rule
which says the following
68% of the data values will have a z-score less than one.
95% will have the z-score
less than two.
And 99.7%
will have a z-score less than three.
if you have an approximately Normal distribution then these values are approximately true.  About 68% of the data have a z-score
if you have an approximately Normal distribution then these values are approximately true.  About 68% of the data have a z-score
less than 1.  Here's an example
Male black bears average 300 pounds.  They are big by the way.  If you see a male black bear
in the forest respect that black bear.  Keep your distance.
Make a lot of noise so that black bear leaves because they are big and they are fierce.
Their standard deviation is about 75 pounds.  I am not sure about that.  I kind of guessed, but it sounded about right.
Let's assume that these black bears
are approximately normal.
What is the range for the middle 95% of black bears?
To find that what we do
is we take 300, that was the mean,
then we subtract two standard deviations 
  300 - 2(75)
and that's 150.
We also can take 300 and we can add two standard deviations
 300 + 2(75)
and I get 450.
I can conclude
that the middle 95% of all black bears weigh between
150
and 450 lbs.
They are big guys!
If you don't have a Normal distribution you can not use the Empirical Rule.  That's really important to remember.
In this class the only way we will be able to decide whether we have a Normal distribution
is by looking at the histogram and saying does look pretty bell shaped or does it not look pretty bell shaped?
We will have one other way, but that is much later and I don't want to talk about it now.  In the advanced courses there are very scientific ways
We will have one other way, but that is much later and I don't want to talk about it now.  In the advanced courses there are very scientific ways
of deciding whether it hit the distribution looks normal are not.
If you don't have a Normal distribution you can and still do something that's similar.  It is called Chebyshev's theorem.
If you don't have a Normal distribution you can and still do something that's similar.  It is called Chebyshev's theorem.
It's a much weaker than the Empirical Rule, but at least it gives us something.
It says the following
At least 75%
have a z-score
less than 2.
Notice that is weaker than saying approximately 95% have a z-score less than 2.
Knowing what the percent is vs. saying that it's at least a number is very different.
Knowing what the percent is vs. saying that it's at least a number is very different.
At least isn't as good as knowing the number.
At least 89% have a z-score less than 3
and at least 95%
have a z-score less than 4.5.
Notice it least 75%, it can be 99% which isn't even close to 75% and that still would be true.
Here's an example.
The average miles per gallon
of cars the United States is 23.8
and the standard deviation is 6.1.
At least what percent get between 11.6 miles per gallon
and 36.0 miles per gallon?
here were knocked in assume that the distribution is normal and because of that we cannot use the empirical rule
Instead we use Chebyshev's Theorem.
We can say that 11.6 is two standard deviations below 23.8.
Notice that twice 6.1 is 12.2
and 23.8 -12.2
is 11.6.
and 23.8 plus 12.2
is
36.0.
That tells us
that we are
within two standard deviations of the mean.
And Chebyshev's Theorem
tells us that at least 75%
of all U.S. cars get between 11.6
and 36.0 miles per gallon because 11.6 and 36.0
are the low side and the high side of two standard deviations from the mean.
Next let's look it one more
graph that we can use.  This is called the stem-and-leaf plot.
The stem-and-leaf plot is a chart in two parts:
the large units and the small units.  The large units are on the left and the small units
are digits
that are placed horizontally
on the right to form a number bar.
This probably doesn't mean a lot to you until we do an example.
Let's do an example.
Here are the exam scores that happened in my last class.
You can see they range between this person who probably didn't study got 24%, not very good on the exam,
and this person got 99% on the exam.  A couple of them did actually.
Here's how we draw stem-and-leaf plot for this data.
First we draw a T-Table.
On the T-Table I will separate this into its two digits.  We have
ones digit and tens digit.  So, I put the tens digit on the left
and the one's digit on the right.
Then I notice the tens digit goes between two
and nine.
It does skip three but I will still put that in.  I put 2, 3, 4, 5, 6, 7, 8, and 9 under the tens digit.
Then I look at the data values.  I have a 24.
That has a tens digit of 2
and ones digit of 4, so I put a 4
for the ones digit.
Then I have a 42.
I don't have any 3's for the tens digit, so that gets left blank.  For the 42, I have a tens digit of 4 and a ones digit of 2.
For the 50's,
notice I have a tens digit of 5, each of them have a tens digit of 5 and the ones are 5 and 9.  So under the ones I put a 5 and a 9.
for the 60's I have a 3 and an 8.
for the 70's, I have a 256777.
So I put:  256777
under the ones digits next to the tens digit of a 7.
I do the same for the 80's and the 90's.
Notice that the stem-and-leaf plot gives the shape of the distribution.  We can tell right away that this is a skewed left
distribution because of the high mode on the right or at the high value
and it has a tail for the small values.
So it is skewed left.  It's also unimodal.
but we could do more.  With a histogram you lose that the exact values.  With the stem-and-leaf diagram
but we could do more.  With a histogram you lose that the exact values.  With the stem-and-leaf diagram
I could tell you for example that there were three scores
of 85 because under the tens of
8, I have three 5's.
So I don't lose my data values.
Sometimes I actually draw the histogram
along with the stem-and-leaf plot.  So here's my histogram
drawn on the stem-and-leaf plot.  Then I have it all.   I have a histogram and the stem-and-leaf plot I don't lose my data values
and I get the shape very well shown.
Here are the issues with the stem-and-leaf plot.  The first issue is that if you are showing this to an audience
that doesn't like numbers that have that fear of numbers,
then this will make them very fearful because there are numbers everywhere.  So for a lot of people they will consider this ugly
and scary.   So stem-and-leaf plots are given to audiences that don't mind numbers.  Scientists love stem-and-leaf plots.
They can scare off the math phobic group.
Another problem with the stem-and-leaf plot, this worked great for a class of size 20 of 30, imagine
doing a stem the plot for the SAT exam scores of all students who took the SAT in America.
that would not work because you have maybe 1,000,000 people taking the SAT and you can't draw on 1,000,000 numbers on a piece of paper.
It just isn't going to work.  This only works when
we have a tame sample size.  A sample size that is maybe less than couple hundred.
When you do have a nice tame sample size and you have an audience that doesn't mind numbers,
it is actually a very powerful chart.
Let's look at a few other charts that I will not spend a lot of time on, but I do want you to at least be exposed to them.
This chart is called a Pareto chart.  A Pareto chart is a bar graph for qualitative data
That is arranged in decreasing order.
Here's an example.
This is an example the survey question is
This is an example the survey question is
why did you break your diet?  You have been on a diet and not any more.  What happened?  Notice it use a qualitative survey question,
because the answer is
it was just too hard to diet or I was bored
and I just watch TV and I eat when I watch TV.
Whatever it might be.
This is a bar chart but not a histogram, because histograms
are representative of
quantitative values while Pareto charts are for qualitative data.
We can see right away that cravings:  I just
felt like having that big half gallon of ice cream.  That was the number one reason
why people broke their diet.  On the other hand, being hungry wasn't very important.
That was not what broke their diet.
You can see with a Pareto chart from highest to lowest really easily.
The next chart is a pie chart and you've all seen a pie chart before.
I want to show you how you construct a pie chart if you want to do this by hand.  You calculate the angle
by taking the relative frequency and multiply it by 360°.
A pie chart is just a circle graph such that each slice's angle is proportional to the category's frequency.  Here's an example.  This is a pie chart
of the air.
Most of the air is not oxygen.  It is nitrogen and about 79% of the air in our atmosphere is nitrogen.
On the other hand about 20% is oxygen
and the rest is other gases like carbon dioxide
Everyone talks about all that carbon dioxide in the atmosphere.  Actually there is not very much, even with global warming.
There not very much carbon dioxied in the atmosphere, but even a little bit will affect the atmosphere tremendously.
That1%, although it is little, is actually very important.
The next diagram I want to show you or the next chart is called a frequency polygon.
That's very much like a histogram but instead of drawing
rectangles we just draw the dots at those frequencies and connect the dots.
And that's it.
This is how many sweets people eat by how many people there are who are the ate that the number of sweets.
We can see that this unimodal,
not quite symmetric.  It is slightly
skewed to the right but both close to symmetric.
When you have real data, you are not going to get  perfectly anything.
You won't get perfectly unimodal or
perfectly symmetric or perfectly skewed left or perfectly skewed right
so you say
somewhat symmetric, but slightly skewed is a typical way of describing data.
The final graph I want to talk about is called a time series graph.
A time series graph is a frequency polygon
but horizontal axis represents time.
Here's an example.
This is a time series graph of violent crimes per 100,000
U.S. inhabitants
from 1994
to 2013.  We can see very clearly
that as time has moved forward
the
violent crimes per 100,000 inhabitants
has really gone down by a lot.
It may seem like things have gotten less safe, but that's actually not true at all.
So be careful about having that feeling that it's gotten more dangerous.
What you really need do is look at the statistics. And only after looking at statistics can you see what's really going on.  Our world has gotten safer
and safer and safer as time has gone on.  At least our country has
really gotten safer.  So you should feel good about yourself.  Things are good.
And if you look at 2014 and 2015 it is even getting better.  So let's hope for wonderful world.
This is the end of the lecture.  It is always good to end on a positive note that things are getting better.
If you had any trouble with any of the topics, there are a lot of topics in this particular lecture.  I don't think
any of it was that intense.  Make sure you fully understand the standard deviation,
because that's the most important topic of the day.
Then if you have trouble especially with the standard deviation, please ask me.  Or if you're in another class,
ask your instructor or ask your friends or ask a tutor.  Just get some help.  Look at this video again
and make sure you understand all of the topics especially the standard deviation.
Just to let you know, the next topic to talk about, the next chapter is on probability
which is one of the most difficult chapters of this entire course.  So be ready for a tough one.  There are a lot fewer
topics, but each topic can get the confused.  So I hope you have a wonderful day or week or
time.  Whatever you will be doing between now and probability.
Brace yourself for some difficult times.
Have a great one.  Goodbye.
