Welcome to the statistics lesson for week 4. 
In this lesson we're going to go over probability 
and Z scores by looking at hospital data from 
children
with various kinds of ailments including 
infectious fever.
 Hospital records of the time showed that 
many children were admitted with measles 
and other infectious fevers. 
I became interested in this by reading that Ada 
Lovelace was admitted to the hospital and was 
diagnosed with
measles, and was even paralyzed by it and I 
wondered how common that was for children 
at the time.
When I investigated, I found that there were 
not enough records of measles
to make a very good project, so I enlarged my 
scope a little bit too infectious fevers, which 
measles is one of-- other examples include 
scarlet fever, typhoid, whooping cough
that sort of thing
and this could be relevant because the 
probability of a given type of diagnosis could 
affect how a hospital is administered
It might be important to have smaller wards 
because
 infectious fevers of course are infectious 
unlike broken bones or nutritional problems 
that sort of thing or it might affect the kinds of 
medicines that need to be ordered based on 
how many patients the hospital expected to 
have based on probability
So my research question becomes what is the 
probability that a child admitted to the hospital
in 1861 was diagnosed with infectious fever?
What we're going to do is count the
 number of patients admitted to the 
separate children's hospitals that have 
contributed records to our database
from 1861,
build a list of the different disease groups for 
patient diagnosis, count the number of 
patients diagnosed in each disease group
and then find the proportion of the whole for 
each disease group.
And as we'll see, that gives us the overall 
probability for that disease group as well.
So
let's take a look at the data.
In this spreadsheet, I've imported
data
from all the hospitals that we have access to.
0:02:14.066,0:02:14.000
I don't need all of these data columns for what 
I'm doing.
something else that I've done it is out of the 
original data
where we have standard disease, and this can 
contain more than one entry, 
and we also have disease group, and this can 
contain more than one entry
I've done a little bit of processing on this data 
and put disease 1 and disease 2 in separate 
columns
where appropriate. And disease group 1 and 
disease group 2 as well.
0:02:42.066,0:02:42.033
the "disease in register" is what was actually 
written down in the record at that time
for the purposes of developing a list, disease 
group one will be fine
 I'm reasonably confident that there isn't 
anything in disease group 2 that won't also 
appear in disease group 1.
So I'm going to go ahead and select this 
column
and as I've done in previous projects
 I'm going to use the advanced filter command
under the Data tab.
And the advanced filter command lets me 
select
a particular range
I'm going to copy that data to another location
and I want to take unique records only and this 
is what's going to extract only one record for 
each, so I will only have 
one row that says tubercular disease, even 
though it appears multiple times in the record
and then I will copy that to
a location that I choose over here and just the 
first cell is sufficient it will fill out the rest of the
column for me.
click OK. Now this is my list of unique disease 
groups. 
I can use a number of interesting functions to 
work with this data.
What I'll do is copy this
and go to a new sheet
to work with this because it'll just be a little 
easier than trying to work on the same sheet.
so here's where I've pasted the list of unique 
disease groups, and I have a couple of different 
functions here that will help me to 
analyze this data. Now, there are specific 
statistics packages that will do a lot of this for 
you
but you can do it in Excel as well, and these 
are some of the ways to do it.
The "COUNTIF" function
lets you count everything in
a particular range
that matches
a particular criterion.
So here, the particular range
goes back to
the data worksheet
column Y
and the criteria
are located
in the same row but in column A on this 
spreadsheet.
So let's just pop over to the data sheet again 
and take a look. Column "Y" is the entire 
disease group.
So let's just pop over to the data sheet again 
and take a look. Column "Y" is the entire 
disease group.
so this is a matter of
so this is a matter of
very very simple
0:05:12.033,0:05:12.000
very very simple
infectious disease analysis.
infectious disease analysis.
I fill in the formula once
drag it down
and because of the way Excel works, for each 
row
it's copying and saying OK here I'm going to 
refer to "A17."
it's copying and saying OK here I'm going to 
refer to A17.
here I'm referring to A9. These are relative 
references and they always look
in the column next door at the same row so 
this gives me an accurate count
for each of these terms
from this column in my other database
having done this I can then sum all of these 
numbers, add them up
and I can do that very easily with the sum 
function
and then select the range, and you can see 
that I've got all of my numbers included here
and then I want to use this to calculate
the proportion and therefore the probability so 
here's my probability calculation
I'm going to take
the value immediately to the left
from the B column
and divide it by the sum from the bottom. So 
here you can see B4 will always just refer to
the immediate left. Since I'm going to copy 
this down, I don't want it to start looking further 
down here for the division value so
this is where I put the dollar signs in and this 
makes this a fixed reference not a relative 
reference. It's always going to go
exactly to this cell.
So simple division,
and I end up with my probability value here, 
and by copying that down
and I think this is fairly helpful.
Now,
as you can see,
the row with the largest number of entries is in 
fact infectious fever
and something that can be very useful to do at 
this point
is to select the range of data (not the whole 
sheet, but just this range, including the labels)
and do a "Sort." So, under data
choose Sort,
and it's going to ask me, does my data have 
headers, and it does, so I want to say sort 
by...
and I want to sort by
the count. I could sort by the probability, 
because that's based on the same
proportion,
but just to simplify things sort by the count,
and I'm going to sort on the values, not on 
something to do with color or what have you
I'm going to sort largest to smallest. That'll give 
me the most significant disease 
group at the top, and the least significant 
group at the bottom.
click OK
Alright. Now it has kept-- because I've selected 
all of this range together-- it's kept
infectious fever with the data that belongs with 
it. I can see that's the top
diagnosis category, followed by tubercular 
disease, and then joints bones and muscles 
(that would include things like broken bones).
So, something else that I can do at this point
is make a chart
go to insert and I'm going to make a column 
chart here
and this is because I've sorted my data before 
making the chart, my data are nicely sorted in 
my chart, and I can really see how steeply
this curve drops
and these are just very unlikely
problems compared to
the ones up here, and you can really see that  
infectious fever 
is just a runaway lead compared to the other 
kinds of
reasons that somebody might be admitted
to a hospital.
So I'm going to just move that down a little bit 
so that it's not
blocking my information
Now, something else that we sometimes 
wonder is, is this data independent of another
variable and in this case I'm curious to know
is any of this data tied specifically to the 
gender of the patient?
In other words, are there more girls or boys 
that have infectious fever? Does it matter if 
you're 
male or female for any of these diagnoses? 
And to calculate that
I'll take
this block of data
in fact, I can take the sum, because I'm going 
to need that as well
and copy and paste that to a new worksheet.
And here, we see I've got the disease group, 
I've got the count,
and now I have a new formula
and this is a subtle variation on "COUNTIF." 
it's called "COUNTIFS." And this lets me 
combine more than one condition 
so previously I went and counted
everything in
the data worksheet
and the Y column. Here I'm specifying a 
particular range in the data worksheet
and I'm saying
I want to count the things that equal
the value in A4-- that's again
2 columns to the left
and
I have a second criterion.
I want to count everything in column D that 
equals F
where F indicates that we have
a female patient
So here's column D, F = female, M = male. So 
what I'm looking for here is
gender indicated here just by female yes or 
no... the number of female patients. Is that 
independent from
the particular disease? And again, there are 
more sophisticated methods that we can use 
to do this kind of analysis, but this a fairly 
simple method
(other than remembering how to do this 
formula) that you can do straight in Excel. So 
here again
I use COUNTIFS -- it's got this extra letter on 
the end-- I look at the Y column 
and the D column, and if something matches 
my criteria in both of those
then they land in this column. So this is a 
female patient with tubercular disease, a 
female 
patient with growth, nutrition and decay, and 
so forth.
Then what I want to do is calculate the 
proportion or the percentage
of patients who were diagnosed with tubercular 
disease, in this case
who are female and this is a very simple 
calculation
this is just the number of females
divided by
the number of patients with this particular 
disease.
So you can think of this as "how many 
females?" out of
all the patients with tubercular diseases.
and in this case 43.7%
Now similarly
we've got our sum
we've got the sum of all of our female patients-- 
this is a fairly easy--
we could've done this by summing off of the 
other page as well.
We've got our percentages but
with our sum and our number of female 
patients this gives us
the proportion or the likelihood
that a patient is female.
So there's a 46.3% chance
that a child admitted to the hospital that we 
have the records from
is female.
A little bit lower than 50%...
that's not too unusual when we have such a 
large data set.
There could be some
reason for this but we're not going to worry 
about that for right now. What we are going to 
do is compare this
to the value that we got just on infectious fever
so we're comparing 45.2%, 46.3%. These 
numbers are not very different. I would say in 
general
that we can say that the
chance of a female patient
and the chance of having infectious fever, as 
two variables,
are pretty independent.
These are not related to one another.
And in cases of ailments of the nervous 
system, (whatever that meant in the day) the 
odds are also quite low and this is quite a bit 
lower
than the 46.3% so fewer girls were admitted 
with this kind of problem
"not classified"-- we can't really do anything in 
particular with that either.
Circulatory system:
far more than the usual number
of patients who were admitted with circulatory 
system problems were female. 65%
compared to 46%.
So, this is how we can test for independence 
of these two variables
So that's where I'm going to end it as far as 
categorical variables go.
I'll have another
video that will show you how to make
probability calculations for quantitative 
variables, which is a slightly different 
procedure.
See you there!
