Welcome to Data Science Methodology 101 From
Understanding to Preparation Data Preparation
Case Study!
In a sense, data preparation is similar to
washing freshly picked vegetables insofar
as unwanted elements, such as dirt or imperfections,
are removed.
So now, let's look at the case study related
to applying Data Preparation concepts.
In the Case Study, an important first step
in the data preparation stage was to actually
define congestive heart failure.
This sounded easy at first but defining it
precisely, was not straightforward.
First, the set of diagnosis-related group
codes needed to be identified, as congestive
heart failure implies certain kinds of fluid
buildup.
We also need to consider that congestive heart
failure is only one type of heart failure.
Clinical guidance was needed to get the right
codes for congestive heart failure.
The next step involved defining the re-admission
criteria for the same condition.
The timing of events needed to be evaluated
in order to define whether a particular congestive
heart failure admission was an initial event,
which is called an index admission, or a congestive
heart failure-related re-admission.
Based on clinical expertise, a time period
of 30 days was set as the window for readmission
relevant for congestive heart failure patients,
following the discharge from the initial admission.
Next, the records that were in transactional
format were aggregated, meaning that the data
included multiple records for each patient.
Transactional records included professional
provider facility claims submitted for physician,
laboratory, hospital, and clinical services.
Also included were records describing all
the diagnoses, procedures, prescriptions,
and other information about in-patients and
out-patients.
A given patient could easily have hundreds
or even thousands of these records, depending
on their clinical history.
Then, all the transactional records were aggregated
to the patient level, yielding a single record
for each patient, as required for the decision-tree
classification method that would be used for
modeling.
As part of the aggregation process, many new
columns were created representing the information
in the transactions.
For example, frequency and most recent visits
to doctors, clinics and hospitals with diagnoses,
procedures, prescriptions, and so forth.
Co-morbidities with congestive heart failure
were also considered, such as diabetes, hypertension,
and many other diseases and chronic conditions
that could impact the risk of re-admission
for congestive heart failure.
During discussions around data preparation,
a literary review on congestive heart failure
was also undertaken to see whether any important
data elements were overlooked, such as co-morbidities
that had not yet been accounted for.
The literary review involved looping back
to the data collection stage to add a few
more indicators for conditions and procedures.
Aggregating the transactional data at the
patient level, meant merging it with the other
patient data, including their demographic
information, such as age, gender, type of
insurance, and so forth.
The result was the creation of one table containing
a single record per patient, with many columns
representing the attributes about the patient
in his or her clinical history.
These columns would be used as variables in
the predictive modeling.
Here is a list of the variables that were
ultimately used in building the model.
The dependent variable, or target, was congestive
heart failure readmission within 30 days following
discharge from a hospitalization for congestive
heart failure, with an outcome of either yes
or no.
The data preparation stage resulted in a cohort
of 2,343 patients meeting all of the criteria
for this case study.
The cohort was then split into training and
testing sets for building and validating the
model, respectively.
This ends the Data Preparation section of
the course, in which we applied the key concepts
to the case study.
Thanks for watching!
