Welcome to project 4 part 2. In this section 
we'll look at the lengths of hospitalization of 
children with infectious fever. We're looking at 
probability as it relates to quantitative 
variables.
0:00:13.066,0:00:13.000
probability as it relates to quantitative 
variables.
Our research question in this case is "what 
was the probable length of hospitalization for a 
child 
admitted with a diagnosis of infectious fever in 
1861?"
We're going to start with a standard 5 number 
descriptive statistics summary. We'll create a 
histogram, 
We'll calculate the probability for each range of 
the histogram, the probability that a term
of hospitalization will be less than a given 
value that we'll choose from our dataset,
and then calculate and interpret the z- score of 
a given number of days of hospitalization.
Again we're using the data from the hospital 
records from the children's hospitals in 
Victorian London.
We've got a selection of data specifically from 
1861,
We've got a selection of data specifically from 
1861,
and here I've
and here I've
and here I've
arranged to sort the data
by the the length of stay in days.
And the way I did that was by selecting all of 
my data
going to the data [tab]
and choosing sort.
I say yes "my data has headers,"
and choose "length of stay in days."
I could also have sorted by any of the other 
criteria here.
I could also have sorted by any of the other 
criteria here.
I also want to filter
0:01:25.033,0:01:25.000
I also want to filter
I also want to filter
by disease group
by disease group
and Excel
2010 has an interesting feature here that will 
allow me to type in the value I want
"infectious fever"
and this will select all the items that include 
"infectious fever," some of which
will also include another value
will also include another value.
Once I've done this,
I can see that I have a subset of the data 
showing,
and I can select
just the data that matches my criteria
and copy.
And this is a noncontiguous selection. You 
can see that in places where it's skipping 
data,
those double "marching ants" lines show that 
it has skipped the data that is not visible.
So when I go to a new worksheet
So when I go to a new worksheet
So when I go to a new worksheet
and I paste
and I paste
I can "paste values,"
So now I have a list of just the infectious fever
entries
and I can see that I have
125 of them. One less than 126.
because my first row is my headers. This is 
the data that I'm going to use to actually run 
my various
calculations on because this is now a 
contiguous set of data that matches the 
criteria that I want to use
So I'm going to start by doing some simple 
descriptive statistics
and all of these formulae are going to reference
the previous spreadsheet so that's named 
"infectious-fever." Because that has a hyphen 
in it,
which would be interpreted as a minus sign, I 
need to put single quotes
around the name of
the datasheet that want to use
and then an exclamation point,
And then the range that I want to draw values 
from
so here I am doing a COUNT
and then I'll do
an average to calculate the mean
an average to calculate the mean,
median, mode, the minimum, and the 
maximum, and the standard deviation.
And I'll use these for different purposes. The 
mean is fairly obvious, we've talked about that
several times in class already.
The mean plus and minus the standard 
deviation
will give me the range that will contain 68% of 
the data
and the minimum and maximum are useful for 
determining what "bins" I want to use when I'm 
creating my histogram. So here my 
lowest value is zero, my maximum value is 69, 
so I'm going to use
ranges of 10 each
up to 70 and that will enclose all of my values. 
So I will have 10-- which will be--
 all of my values that are less than or equal to 
10 will be included in this bin, 
and then 20 all the values and not included in 
this bin
but up to 20 will go in this bin, so that will be 
11 through 20, 21 through 30, and so forth.
but up to 20 will go in this bin, so that will be 
11 through 20, 21 through 30, and so forth.
In order to create the histogram, I am again 
going to use
the special features under data
Data Analysis,
select "Histogram,"
I'll start by saying "What will my input range 
be?"
Now, this is going to come from the infectious 
fever data.
So again, I will select length of days,
0:05:00.066,0:05:00.000
So again, I will select length of days,
and that can include blanks, for example in 
this case where Ann Rose doesn't actually 
have a number of length of days, but must not 
include the label at the top of the column,
and then bin range
0:05:16.033,0:05:16.000
and then bin range
and that is going to be
just this group of cells here again
not the label at the top
just the numbers that I want to use.
And then I can specify an output range.
That's where my
frequency table will be produced.
I could also have a generated it on a new 
worksheet or even in a new workbook.
And it can be helpful at this point
to generate chart output as well.
Click OK.
If you have any data already in your output 
range it will be overwritten so Excel warns you 
about that. I've done this calculation before.
So it creates the calculation for me and then 
automatically creates
a histogram for me with the appropriate labels.
a histogram for me with the appropriate labels.
Now what I've done
is I've calculated probability
by taking the frequency value that was 
calculated
and dividing it by the total.
And then I've also, on top of that histogram 
chart that was created for me,
And then I've also, on top of that histogram 
chart that was created for me,
I've added some lines. And in order to do this
go to Insert
and choose shapes,
and there is a line tool, and with the line tool 
you can actually draw anywhere
on your spreadsheet, but in this case I'm 
drawing on top of
the chart itself. So in this case I'm estimating.
I'm going to estimate
where I want my lines to be based on the value 
that I've calculated for my mean at about 22 
so little bit more than the base. Remember, 
this column represents
20 to 30. So just a bit in from 30, and then my 
mean minus standard deviation
is a bit more than six
so we see that here in the 0 to 10 range. And 
then my mean plus standard deviation
is
just about 38
so I've put it over here in the 30 to 40 range
most of the way towards the edge.
So this gives me a nice picture
of
what I can expect to see
in length of hospitalization in days. Most of my 
values are in the 10 to 20 range even 
though the mean itself is a bit higher than that. 
So this is reflected in the median value
of 19. That's in the 10 to 20 range.
And then I can sort of see the shape of the 
data, and there's a spike over here. So this is
an asymmetrical curve.
I wanted to calculate the probability of a stay 
of 20 days or less,
and the way I calculate that:
I6 + I7, here's my I6, I7 and what I've done here 
is just said 20 days or less will be 
the first two bins. So all I need to do is just 
add these two probabilities
and that tells me what the probability is
that a patient's stay will be 20 days or less. If I 
wanted to say what's the probability that the 
length of stay will be
more than 60 days,
that would be 3%.
More than 50 days that would be 6%: 3% plus 
3%,
and so forth.
and so forth.
One of the things that you can do as well
is calculate and create a graph that gives you 
a kind of
curve.
Now, this is an approximation of what a normal 
curve would look like.
I made a chart
that is a scatter point chart
So here we have a number of data points
and for each bin,
the 10-20, the 20-30 and so forth, for the 
number of days
I have a frequency count
for the number of patients who stayed for that 
number of days
and I can see how the slope drops off here. 
Now this is the length of stay in days for my 
entire dataset, not just infectious fever. I 
needed more data to get a smooth kind of line. 
And then to get that line,
for my chart type,
move across in "XY (Scatter)," and there is a 
"Scatter with smooth lines,"
and this will connect the dots for me. This 
creates that curve. So you can see that with 
the 
overall data set for hospitalization the curve is 
even more skewed
than what we saw with
the infectious fever.
the infectious fever.
The last thing that I want to do is calculate the 
Z Score. Now the Z Score is a special kind of 
measurement that we use. It's called a 
standardized measurement, and the idea is 
that when you have different kinds of data and 
you're trying to consider how likely or unlikely
an event is-- a specific event out of your 
dataset-- you want a standard way of 
measuring that. So the standard
measuring stick that we're going to use
is the standard deviation. So what we're going 
to do is we've got our count, we've got our 
mean,
we've got our standard deviation,
we've got our standard deviation,
mean plus and minus.
I'm going to look at this particular entry.
So here we have George Ryan. He had 
measles
at the age of 5 years and 0 months. He was 
there for 18 days.
how common, how probable is it that someone 
would be staying for 18 days
out of all the infectious fever
out of all the infectious fever
patients?
So the way we calculate is simply
we take our
number of days that he has stayed
minus
the mean
divided by the standard deviation.
0:11:24.066,0:11:24.000
Let's think about how that's going to work. 
We've got 18 so
Let's think about how that's going to work. 
We've got 18 so
George has stayed 18 days.
The mean
is over here.
We take 18 and subtract the mean, and that 
gives us the difference
between George's value and the mean. But we 
want to know how we're going measure that 
difference with reference to
the standard deviation which is a bit larger.
So here again, the standard deviation is 15, 
the difference between George's value and the 
mean
is about 4.
is about 4.
is about 4.
And then we're going to divide that by
the standard deviation,
this is about 16
4 out of 16 is about 1/4.
And in fact we see, since this is going in the 
negative direction it's below the mean,
it's 0.24%, about 1/4 of a standard deviation
it's 0.24%, about 1/4 of a standard deviation
it's 0.24%, about 1/4 of a standard deviation
below the mean.
That's it for this week. This was of a fairly
lengthy set of instructions, but I hope at this 
point you can understand that this is not a 
very complicated topic. It's all about
calculating probability
by looking at a whole set and looking at a 
subset and comparing the two
in terms of proportion of the incident
and then comparing the two in terms of
standard deviations and means when we're 
using quantitative variables.
Thanks a lot, see you next week.
