Dr. Jack Van Horn: Good morning, everyone.
I'm wondering if we should
just get started here.
Thank you everybody for
clicking over from the GoToMeeting site.
We realized quite quickly
that the GoToMeeting site
wasn't going to be adequate
enough to hold everybody
who is coming in.
And I'll go over, and I'll make
an announcement on that site
here in just a moment.
But I wanted to not
delay anything further,
because we want to certainly
hear from Dr. Musen.
We want to be able to hear
from Dr. Bourne and others.
So want to get
things underway.
Without further ado, I'm
curious if Michelle Dunn
would like to make any sort of
announcements before we begin.
Dr. Michelle Dunn: No, I think we're
ready to begin.
Dr. Jack Van Horn: Fantastic.
So I think we'll
just turn things over
to Dr. Mark Musen from
Stanford University, who's
going to be speaking
on the Introduction
to Big Data and the
Data Life Cycle.
If Dr. Bourne is available,
though, beforehand,
perhaps that's how
we want to begin.
Dr. Phillip Bourne: Yeah, I'm here.
Thanks, Jack, let me just
say a couple of things.
So welcome to the first lecture
in The Big Data Knowledge
Fundamentals in
Data Science series.
It's clearly, very exciting.
We've got a large
virtual turnout.
I guess I would say-- I'd have
to say good day, because I
don't know what
time zone you're in
and where you're coming from.
But thank you for joining us.
I think this is--
already, we see
this is going to be a very
successful initiative.
So I'm the Associate Director
of Data Science at the National
Institutes of Health.
And I coordinate The Big
Data Knowledge Initiative.
We put together this program
of a series of lectures
that sort of collectively
cover data science.
And, clearly, there's a
lot of interest in this.
And they're going to occur
once a week for the next nine
months.
So, hopefully, if
you're on this,
I presume you have
a schedule already.
Before I introduce
Mark, I just wanted
to thank Jack and the
Training Coordination Center
for taking the lead in
organizing this series.
And Michelle Dunn
in our own office
is stellar at putting these
things together, and getting
things going.
So it's a joint collaboration
between the Training
Coordination Center, and
the Centers Coordination
Center, and the NIH.
So let me now quickly
introduce Mark.
I won't give him a
long introduction,
although he deserves one.
But he's the kind of guy who
will be happy with a short one.
And thank you for Mark
for setting the stage
for the whole series,
and also of course
being involved in the
planning of the series.
That's really much appreciated.
I've known Mark for a long time.
And he's done some
really stellar work.
And I'm not going to go
into all of the detail.
But let me just
quickly say, he's
currently a Professor of
Biomedical Informatics
at Stanford, as well as the
Director of the Stanford
Center for Biomedical
Informatics Research.
His research covers a lot of
areas, notably intelligence
systems, re-useable ontologies,
metadata for date publication,
biomedical decision support.
He's been recognized
in various ways,
including the recipient
of the Donald Lindbergh
Award for Innovation Informatics
and the American Medical
Informatics Association,
lectured to the American
College of Medical Informatics
and the Association
of American Physicians.
And he's a founding co-editor
in chief of the Journal
of Applied Ontology.
So, Mark, thank you for
agreeing to kick this off.
We're all very much
looking forward
to what you have to say.
Thanks, again.
Dr. Mark Musen: And thank you,
I'm real grateful.
This is rather an
awesome responsibility
to give the first
lecture in this series.
And I'm actually
very excited to do,
because I think
the lectures that
will follow me will
be even, perhaps,
more interesting than
this general overview.
But I think it's very
important to sort of frame
what we'll be doing
over the next few weeks,
and to give you a general
introduction to data science,
and where we are heading, and
where the field is heading.
So as you know, this
is the first lecture
of a whole series
that's been put together
by the BD2K Training
Coordinating Center,
with help from the Centers
Coordinating Center,
with the NIH staff,
and with a little bit of help
from me.
And we're very,
very excited to be
able to give you a whole series
of lectures in data science
that will, obviously, be
very exciting this quarter,
and then will continue
through the rest of the year.
What I want to do
this morning is
to foreshadow a
lot of the things
that you'll hear about
from the other speakers,
and to give you a sense for
why there's so much excitement
about data science, and why
this field is so important.
And it, clearly, is important,
because the world is changing.
We think in terms of data
in ways which we probably
never did before.
We used to think of
data as sort of the side
effects of scientific research.
And now, we recognize that
data are first class entities
of themselves that also
require careful consideration,
careful thought,
careful planning.
And what I want
to do this morning
is to give you a sense for
what some of that planning
needs to be, in order
to get the most out
of the kinds of data that
are acquired as consequence
of scientific investigation.
A lot of stuff is
floating around the web
about how the world is
changing because of data.
There was a recent
piece on PBS that
made claims such as,
all the data processing
we did in the last two years
is more than all the data
processing we did in
the last 3,000 years.
Or we are now being exposed
to as much information
in a single day as our 15th
century ancestors were exposed
to in their entire lifetime.
And ever two days,
the human race
is now generating
as much data as were
generated from the dawn of
humanity through the year 2003.
These are, obviously,
breathtaking sentences.
Frankly, no one
quite knows where
are the data that support
these particular sentences.
But it gives you a sense for
why the world is recognizing
that the onslaught slot
of information around us
clearly is changing the
way we live qualitatively.
And biomedicine--
we have the onset
of the [INAUDIBLE] revolution,
which has led to more data
than anyone could ever imagine.
In clinical medicine, we had
the advent of electronic health
records, whose data
are now being mined
in all kinds of ways, and where
we recognize that we may not
always have the skills
that we need in order
to appreciate the
kinds of information
there are coming around
us in biomedicine.
The term big data, of course,
is what's on everyone's lips.
And at the same
time, it's a word
that has a lot of overload.
It has a lot of
associations made with it.
I'm not going to try to
define big data intentionally.
I think that's
kind of hard to do.
But it's very easy
to describe it
in terms of its characteristics.
And the three that
are often mentioned
are the idea of data volume--
just a lot of data, velocity,
the way in which data
comes at us very quickly,
and variety, the
heterogeneity in the data
that we're acquiring.
And to that, people
often had the idea
that there's a
question of veracity.
We get lots of data.
We don't always know the
truth value of the data
that we've received.
And volume is not what
I meant to say here.
But, yes, I will remember
the fifth V later.
This is just a slide from IBM
trying to sort of summarize
what they believe being the way
in which they view big data.
And I think it's important
not to necessarily
look at all of the examples
that they give on their slide,
but to make it clear how much
the big data revolution is
affecting not only what
we do in science and also
what's happening in industry.
And we see that throughout
society, there's
this increasing desire to be
able to identify where data are
coming from, to deal with
data, and to be able to prepare
for the fact that, in the future
the amount of data that we deal
with will even be greater.
And although it's easy
to think about big data
and emphasize the
word big, and think
about quantitative
differences, I
think what makes things
important in data science
is the recognition that
data-- because of the way they
are changing our world-- are
affecting things qualitatively.
Suddenly, we're in a situation
where our servers no longer
can store all the
data that we used
to store in a single place.
And we have to think
about new solutions, both
in terms of hardware and
the kinds of data systems
that those hardware support.
We have to think
about new algorithms,
because we suddenly can't
process data in one sitting.
We suddenly have to recognize
that, because of the volume,
and velocity, and everything
else about the data,
we need to have algorithms
that can keep up with them.
And probably most important,
although not something
that gets a lot of press, is
that we're moving into an era
where we can't get our
arms around the data.
We can't look at the data.
We can't really
understand, intuitively,
what the data all mean.
And so there's a need to be
able to deal with the fact
that our cognitive
ability is strained,
because without
the ability to deal
with the visualization of
data, and to get an intuitive
understanding of
data, we often have
to go purely on the
basis of our algorithms,
which, obviously,
limits our ability
to develop intuition about
the work that we're doing.
This is a side
that was generated
by Forrester Research
trying to identify what
are all the various
technologies that are required
to manage the data revolution?
I'm not going to go through
this slide in detail,
because, frankly, a lot of
the stuff that's on the slide
is going to be
handled in the talks
that you're going to be hearing
in this series coming up.
But I think what's
important to emphasize
from this slide is the fact
that Forrester believes
we've had significant success
in a lot of the infrastructure
needed to manage big data.
And what's interesting is that
all of these big successes
happen very rapidly.
And the time to reach
the next phase, whatever
that next phase might be
in each one of these areas,
is not that long.
And so as far as the
infrastructure is concerned,
Forrester would argue that
we're making progress.
And, certainly, when it
comes to biomedicine,
NIH recognized a couple
years ago the importance
of addressing the big
data challenges that
are facing us in biomedicine.
And as you know, the creation
of the Big Data To Knowledge
Initiative, and
this seminar series
which is part of
that initiative,
has been a major undertaking
on the part of the NIH
to be able to move biology and
medicine into the next era.
And so I think there's
a lot of excitement.
There's a lot of hype.
But there's also
a lot of progress
that's being made in this area.
And that's, obviously,
the important stuff
that we'll be talking
about in this seminar.
What I want to do this morning
is to begin with a story.
And my story, basically, is
that of a colleague of mine,
Purvesh Khatri, who is a
faculty member at Stanford,
and who describes himself
as a data parasite.
And I think it's
important to give you
a sense for the kind of
work that Purvesh does,
because I think it motivates a
lot of our concerns about data
and the data life
cycle, which I'll be
getting to later in the hour.
So Purvesh is mainly concerned
about functional genomics data.
He knows that out there are
all kinds of microarray data
that can be very useful in
understanding how genes get
turned and turned
off in response
to various biological signals.
And he knows that
all of the data that
correspond to those
kinds of experiments that
are at least performed
in the United States
are available to the NCBI's
Gene Expression Omnibus, or GEO.
Go out to GEO, and in
this large database,
you will be able
to find information
about basically every
microarray experiment that's
ever been done.
And Purvesh goes to GEO
with the express desire
to understand what he can
learn from the data that
had been collected by
other investigators
and deposited in the
GEO data repository.
So Khatri, basically,
is building his career
about studying
other people's data.
And, basically,
what that means is
looking at the publicly
available data that
are out there on
the web in order
to understand what are
the different kinds
of experimental conditions that
investigators are considering,
what kind of real world
heterogeneity might enter
into their data
sets, and recognizing
that he can do all
this without having
to perform his own experiments.
So there's really no
experimental cost, at least
in terms of the initial
data collection.
He can look at human samples.
And he doesn't have to
worry about IRB approval,
because that's
already been done.
And what Khatri has done is
to create a pipeline that
allows him to search
for datasets in GEO,
to look for genomic
signals, to confirm
those signals in validation data
sets, and then to confirm what
he sees-- rather to
[INAUDIBLE] in a test set,
and then to confirm those
signals in validation data
sets.
And what this
allows him to do is
to see if he can infer
information from existing
microarray studies that may
not have been considered
by the investigators
who did those studies
in the first place, but
where he can make discoveries
on the existing data
sets without ever
having to perform an
additional experiment.
And so the outline
of that flow of work
is shown on the right
hand side of the slide.
And he's demonstrated the
value of this approach
in a wide range of areas.
One that's gotten a
lot of press lately
is that of diagnosis of sepsis.
So sepsis is,
basically, a response
to infection in the
body that can be very
difficult to detect at first.
It can be confused with
lots of other common kinds
of abnormalities that occur
in hospitalized patients.
Usually, it's response to
infection in the blood.
But there can be lots
of non-infectious causes
of inflammation that can look
a lot like sepsis as well.
And so coming up
with this diagnosis
is, obviously, critical, because
unless you make this diagnosis
and treat the underlying
infection, patients die.
And what Purvesh has
done is to go out
to GEO, the Gene
Expression Omnibus,
looking at functional genomics
data with a clinical question.
Are there changes
in gene expression
that can predict the
advent of sepsis?
And Purvesh has gone out
there, found some sample
data, some validation data
looking at his initial samples.
He looked at nine
cohorts with 663 samples,
identified candidate
genes, 82 candidate genes,
that might be suggestive
of the onset of sepsis,
and, ultimately,
validated 11 genes which
showed to be important,
both in his test
set and his validation set.
And when he looks
at those genes,
he shows really
phenomenal ROC curves
that show that these genes
are very, very sensitive,
and get turned on when
sepsis is imminent.
And he can show an effect,
where with the advent of sepsis,
with each passing day
those genes get turned on
to even greater degrees.
I'm not going to go into the
details of Purvesh's numbers.
But they're just
absolutely exciting,
because this is a way
of trying to identify
a new way of diagnosing
a medical condition
purely on the basis
of other people's data
stored in a public repository.
And Purvesh has
taken this approach,
and applied it to not only the
diagnosis of incipient sepsis,
but also the diagnosis
of tuberculosis,
and distinguishing active
TB from disease that
is burned out, which is
actually very hard to do
clinically, distinguishing
viral respiratory
infection from bacterial
respiratory infection--
another very important clinical
challenge that usually results
in a lot of over-medication
of patients antibiotics that
can be avoided if one
knows that something is bio
from the beginning.
He's used this same
approach to identify
rejection of organ transplants
and other difficult diagnostic
problems in clinical medicine.
All of these challenges are
really tough to do clinically,
but can be addressed
easily by looking
at changes in gene expression.
And the most exciting thing
about this whole adventure
that Purvesh has been
undergoing is that he's never
touched a pipette.
He's never had to kill a mouse.
He's never actually had
to do the experiments that
lead to those data,
because the people who
have done the experiments have
been very cognizant of what's
needed in order
to make their data
not only valuable
for the experiments
that they are planning, but
also valuable for the future
of scientific work.
So you can be a data
parasite like Purvesh Khatri
only if the data that
you might want to find
are in some public
repository, that
are findable through some
sort of a search facility,
that you can in
some standard format
that you parse and
compute with, and most
important, if the data
themselves are self-describing
so that you know how to
make sense of the data.
They're not just a bunch
columns of information,
but rather they actually
can describe themselves
to understand what
the columns mean.
And you can make
sense those columns,
and that the product of
careful planning, organization,
and stewardship.
And they're basically created in
a way where the data themselves
are intended to be reused
and to outlive the experiment
from which they were collected.
And I guess that's
the key thing,
because in the old
school of thought,
we might think of
data as being the side
effects of our experiment.
We publish our paper.
And the paper that we publish
is a result of our research.
And I think, currently,
the tide is shifting.
And we're thinking about
the data themselves
as being the output
of our investigation.
And we're thinking about data
as having a lifecycle that
starts with the idea
of planning what we're
going to do with the
data, collecting the data,
performing quality
assurance on the data,
describing those data so that
we understand what the data are
about, preserving
them in an archive,
making sure that it's
possible for other people
to discover those
data, to integrate
those data with other data
that they might have available,
to perform new kinds
of analyses that
might lead to new insights
that might cause us to do
new kinds of experiments, which
themselves will require us
to begin this life cycle,
to plan experiments,
to collect data, and to
perform QA, and so on.
We've moved into a
world, basically,
where the experiment begins
not with the definition
of the protocol that we're
going to follow, as much
as the definition of the
plan for data management
that will be necessary in
order to perform the protocol,
and make sure that
the data themselves
outlive the experiment
that we're planning.
And so what I want
to do is allow
us to follow this
data life cycle
and get a sense of what these
various steps are, emphasizing
that each one of
these steps is going
to be the subject
of a talk you'll
hear about later in
this colloquium series.
Let me start by saying
that one begins with plans.
And the data management plans
that guide our work, basically,
are required now by almost
every finding agency, because
of executive orders.
At the NIH, if you
have a project that
involves more than $500,000
of direct costs every year,
you must have a data
management plan.
If you deal with geonomic
or sequence data,
then you must have a
data management plan.
And, frankly, if
you're smart, you're
going to have a data management
plan, because the data
management plan saves lots of
time and effort down the road
by making sure we know what
we're going to do with the data
when we get them.
Most important, they
enhance the durability
and the long-term
value of the data
so that investigators
like Purvesh
can find these data
later, and identify
even new kinds of inferences
that can come from those data.
As I said, they're
required by sponsors.
Unfortunately, although the
data plans are required,
they're almost never
reviewed as a consequence
to peer review, at
least at this point.
That may change.
But right now,
the requirement is
that the data plan exists--
not necessarily that it satisfy
the criteria of peer review.
But it's there.
And, frankly, in
the best projects,
the data plan is not
created once at the start,
but rather is revisited and
updated throughout the project
so that we can make
sure that we're dealing
with data appropriately.
That data plan requires
us to be thinking about,
what is the project that
we're actually doing?
What are the existing data
that may already be available
that we want to be able to
consider in our experiments?
What are the new data that
we're going to be generating?
What are the methods
and the instruments
for the data collection?
If we collect the data,
how do we organize them?
What is the data storage format?
What are the structures?
What is the mechanism by
which we store the data?
What permissions
do we need to allow
people to access the data?
Are we going to backup the data?
Hopefully.
How often will we
backup, and how?
What kind of data scrubbing,
or quality assurance,
do we anticipate?
What are the administrative,
legal, and ethical concerns?
Are we dealing with
human subjects data?
How are we going to
protect human subjects?
How do we archive
and publish our data?
What are the
responsibilities and duties
of the various project
members who are going
to be contributing to the data?
And what are the
costs and resources
that we need to do
this experiment?
Lots of stuff-- it's
a little bit daunting.
The good news is that there
are open source tools that
are readily available.
There's one called DMP tool
that almost all research
universities seem to be using.
And there are lots of
easy ways for, basically,
using templates to create
data management plans,
and to make sure that they're
available with the proposals
that we submit, even before we
begin to do experimentation.
And at the same time, because
these are available and easily
editable, they allow
us to make sure
that our data
management plans evolve,
as we recognize
additional complications
as experiments get performed.
So then we collect the data.
And collecting the
data can be done
in a gazillion different
ways, because, frankly, there
are a gazillion kinds of data.
And so our data management
plan provides the guidance
that allows us to understand
how to deal with the data
that we're collecting.
Are those data
going to be coming
from laboratory observations?
Are they going to be coming
directly from instruments?
Are they coming from
surveys that we may
administer to human subjects?
Are they coming from
continuous electronic signals?
Are they coming from
discrete values,
such as microarray chips?
All kinds of data
are used in biomedicine.
And we have to understand
how to get them in one place
so that we can begin
to analyze them.
Data may also not be coming
from de novo experimentation.
But as I showed in the
case of the kind of work
that Purvesh Khatri does.
They may come from
online repositories.
So our collection
of data may not
come from direct
experimentation,
but from gathering data from
the publicly available data
sets that other investigators
have put online.
And sometimes, if we're
dealing with meta analyses
or systematic reviews,
we may be getting data
not from actual data sets,
but from other publications.
And it's amassing the
publications themselves,
in order to perform our work.
Now, on September
30th, we're going
to have a talk on data curation
by Pascal Gaudet
On November 4th, we're going
to hear about data warehousing
from Chaitan Baru.
So be prepared for
all kinds of talk
coming up that will deal
with these kinds of aspects
of the data lifecycle.
Quality assurance is sort of
the next step in that lifecycle.
Sometimes, people call
it data scrubbing.
And it is actually a
science in and of itself.
[INAUDIBLE] will give a talk
on this on December 2nd.
We'll talk about how to double
check the values of data that
are edited by hand, how to
identify data that may have
quality problems that
need to get flagged,
how we can use statistics to
understand where we might have
data outliers, how we can
identify where data might be
impossible, how we can
identify missing data,
and maybe correct the
values for missing data
or impute replacements, how
to deal with those outliers.
Basically, there's a whole
science, if you will,
to dealing with the fact
that our data are usually
full of errors.
And before we actually make
data publicly available
and apply our own investigative
techniques for those data,
there are things that
we need to do in order
to make sure that the data are
going to meet our expectations,
and be reliable for our own use.
So we plan our data overall,
our data management overall.
We collect the data.
We perform quality
assurance and scrubbing.
And then, we're in
a situation where
we need to describe the data.
So we can't just have data
and stick it in a drawer.
We have to be able to recognize
that we need meta data.
And as Jason Scott
says, metadata
is a love note to the future.
It's our ability to describe
data so that down the road,
we can figure out what we did.
And we can make sure
that other people who
may want to think about data,
or think about our data,
know what is in
the data, and how
to make sense of those data.
So metadata describe lots
and lots of different things.
They describe the
digital content.
So what are the
actual properties
of the files themselves
that store the data?
They provide a
mechanism for clarifying
who are the investigators?
And who are the stakeholders
who care about this data?
What is the scientific context?
So, basically, why was
the experiment done?
What were the data that
are actually collected?
What were the methods
that were applied?
What instruments were used to
acquire the data-- not only
the kind of instruments,
but actually
the physical instruments?
What was the serial number
of the instrument used?
When and where were
the data collected?
And, basically, what are
the parameters of the data?
Lots of stuff.
I should mention that this
list of metadata attributes
comes from the Data One primer
on data management, which
I think is great.
I'll give that as a reference
at the end of this talk,
and also let you know
that on October 21st,
Susanna Sansone
from Oxford is going
to give a talk on metadata
and metadata standards.
And I think talk will really
be very, very helpful.
Now, I mentioned
that not only do
we want to understand the
experiment in the metadata.
We want to understand the
parameters of the data.
That may seem pretty trivial.
But, in fact, there's
lots of evidence
that shows that when
we don't do that,
we get into really big trouble.
We certainly have good examples
where you have trouble making
sense of manageable data.
I think one of
the best anecdotes
is what happened in 1998
when we sent the Mars Climate
Orbiter to Mars.
It's a great anecdote.
It's actually an
unfortunate anecdote,
because it really was a
$328 million disaster.
As those of you who may have
remembered this misadventure,
the manufacturers
of the satellite
did their work correctly, and
followed the specifications,
and managed all their data
using the metric system.
But when NASA sent data up
to the Mars Climate Orbiter,
those data were sent
in the English system.
And suddenly getting
data in foot pounds
was something that really
threw off the satellite.
And it crashed into
the Martian surface.
And the idea that NASA
would not recognize
that there needed to be unit
conversion was, obviously,
rather embarrassing,
and probably provides
one of the best examples of
why not only does are metadata
need to describe what
experiments we did,
but also provide the
data-specific information that
allows us to
understand what it is
that-- how our data are
actually represented,
what those various units are.
So the metadata
about the parameters
will include how each parameter
is measured or produced,
what its units measure
are, the formats used
in the dataset, the
precision, accuracy,
and uncertainty with which the
data values are represented,
definitions of any
codes that are used,
what kinds of quality assurance
was performed on the data,
known problems that
can limit the data use,
and how to cite the dataset.
These are real things,
of course, that we
would like our metadata to say.
And I guess one thing that I
should be making very clear
is that we often do--
actually, we usually
don't have metadata that
describe information
at any of this level of detail.
And, of course, one of the goals
of the research in data science
is to make it easier
and better for us
to be able to make sure
that our data are described
at the right level of detail.
So if we go to the Gene
Expression Omnibus,
this is the place where we
store all of the microarray data
that, for example, Purvesh
Khatri accesses, one
of the things that
is very clear is
that there is a
template for how data in GEO
need to be represented
in the database.
And this isn't just any old
template created by NCBI.
Basically, the framework
for representing
gene expression data in GEO
stems from a description that
is now almost 16 years old
called the minimum information
standard about a microarray
experiment, or MIAMI.
And MIAMI basically
represents the work
of a community of investigators
who got together and recognized
that people were storing
microarray data online.
But they were doing it in
a rather ad hoc fashion.
And it was often impossible
to look at the online data,
and understand how to make
sense of the data that
were stored there.
And the community
said, look, we're
not waiting for some
publisher, or some sponsor,
to force this on us.
We want to get
together and come up
with a minimal set of
information components
that need to be
described in order for us
to understand what someone else
may have done with the gene
microarray.
And that means,
basically, understanding
what the raw data are, how
the processing was done,
how the sample was
annotated, and all the kinds
of features that are
critical in order to make
sense of these kinds of data.
And what's really very
exciting is not only
has the microarray community
created this kind of standard,
but lots of other groups have.
If you go to biosharing.org,
which Susanna Sansone will
be talking about
on October 21st,
you can find the MIAMI
description online,
and a way of getting a good
reference for what MIAMI looks
like, and knowing
how to use MIAMI
to describe microinformation.
But what's great,
biosharing has information
on hundreds of different kinds
of biomedical experiments.
And so going to
biosharing will allow
you to find out information
about how to represent metadata
in a whole variety of
community-blessed approaches.
And it's really very
exciting that, I think,
the biomedical community
is getting together
and articulating these
kinds of frameworks.
And you'll hear a lot
more about this in October
when Susanna gives her talk.
And just to give a little
bit of plug to my own work,
I should say in
the Cedar project
that Phil mentioned,
what we're doing in Cedar
is developing
computational methods
for taking these kinds of
standardized metadata templates
that are already being
promulgated by the community,
and putting them in a
computational form that
makes it really easy, we hope,
for investors to fill out
these checklists and
templates, and to create
the kinds of robust and
complete metadata that we think
are going to be really, really
important in order to ensure
that the information that is
stored in public repository
provides enough description so
that people can find what they
need and put it to use.
Now, you'll notice when we look
at metadata templates like this
cell-like template
from the Links project
that's in Cedar, there are lots
of blanks to get filled in.
And one nominally could
fill out these blanks
just by typing in
text of any kind.
But one of the
things that, I think,
is really important
for creating metadata,
and something that the data
science committee recognizes
increasingly, is that
what we really want to do
is fill in those
blanks with terms
from standardized
terminologies or ontologies.
I don't have time
this morning to talk
about ontologies in any detail.
I know that on October
7, Michel Dumontier
is going to give a whole
talk on ontologies.
This slide, we showed
you a little snippet
from the National Cancer
Institute Thesaurus
is one ontology that's
become very important
for cancer biology.
And all these
hundreds of ontologies
that are available
in biomedicine
provide controlled
terms that make
it possible for investigators to
fill in blanks using terms that
are precise, that have
agreed upon meanings,
and basically will be
recognized for those meanings
when people process the kinds
of metadata descriptions
that we need in order to
access metadata online.
So we've planned
out data management.
We've collected our data.
We've performed
quality assurance.
We've described our data using
templates and ontologies.
Now, we have to
preserve the data.
We have to put the data on
our own server, so to speak,
in order to be able to use
those data for our own analyses.
And, ultimately, we want to
be able to preserve those data
for the future so that
others can access our data,
make sense of those data, and
do the kinds of experiments
that Khatri does.
You would think putting data
on a disk is not all that hard.
You would think that knowing
how to put data in a form
where people could access it
in a few days, in a few months,
or a few years is
not that difficult.
But, actually, it
requires a lot of planning
and a lot of thought.
And perhaps we go back
to the NASA example,
since it sometimes is
fun to pick on NASA.
But, certainly,
as someone who is
very captivated myself
by the space program,
and got so excited by what
NASA was doing in the 1960s
and 1970s, one of the
more famous stories
to come out of that
era concerns the fact
that much of what came
out of the Apollo program
was captured in a
variety of data formats,
ranging from numerical
data sets to video.
Probably the video that
we all remember the most
is the video of Neil
Armstrong when he first
set foot on the moon.
Many of us watched these videos
with just absolute amazement
to think that people were
standing on the lunar surface.
What was interesting is that
around 2006, there was suddenly
a recognition that when you look
at the still photos of people
in Mission Control watching
the astronauts on the lunar
surface, at least in
the Apollo 11 mission,
the videos weren't
nearly this grainy
and weren't nearly this bad.
And, in fact, when you
look at the stills,
you see that in real
time, mission control
was seeing phenomenal
photos that
were taken from the video
available from the lunar
surface, whereas those
of us watching on TV
saw these kinds
of blurry images.
And there was a great revelation
that took place in Australia,
where the original
signals from the moon
were collected in the
Apollo 11 mission.
And this article, which appeared
in the Sydney Morning Herald
on October 7, 2006
describes what
was presented as one
giant blunder for mankind,
because, basically,
what happened in 1969
was that the really high
density video image that
came from the lunar
surface was stored
in a particular format that
got saved on magnetic tape.
What the rest of us saw, because
that particular format was not
something that could be
broadcast internationally,
was a monitor that was
displaying the high quality
image coming from the
moon that was being filmed
by a standard video camera
so that the signal could
be sent out in NTSC formats
to the rest of the broadcast
world.
And it is actually
a video of that NTSC
that we have preserved.
And the original
high quality video
of Neil Armstrong
stepping on to the moon
was stored on a videotape.
That sounds great until you
hear that sometime in the 1970s,
NASA realized that it
needed some more video tape,
and recorded over the
original videotape that
had the high quality
transmission stored on it.
And it's sort of
a really sad story
of failing to have a
data management plan,
failing to recognize that
this video might have
lasting value to humanity,
and that an expedient need
to have a new tape to
copy some new data on lead
to the loss of something which
I think most people would view
as extremely important.
And so it's sort of a
way of making us all
remember that
media are critical,
that the media that we use
today to store our data
may not be the media that
we're using in 10 or 15 years.
I think all of us who've been
around long enough, thinking
about what kind of data storage
we had in the 1970s and 1980s,
that's pretty clear.
And there's also the issue of
just what people have called
bit rot-- that over
time, sometimes
the way in which we
store data even digitally
leads to loss of information.
And we have to recognize
that we need copies.
And we need backups.
And the good news is on
September 30th Pascal Gaudet
is going to be talking
about data curation.
And [s]he'll talk a lot more,
some of the physical problems
of storing data, and
how to avoid some
of these really ugly things.
So life cycle involves planning,
collection, QA, description,
preservation.
And then, ultimately,
our goal is to discover.
We want to be able to find
the data that we need--
not only for our own
experiments, but also,
as we've discussed already,
for other investigators
to find our data, and
discover new kinds of things
from those data.
And so the good news
is, lots of work
is taking place in the
biomedical community
in this area.
We are no longer stuck in
the era of card catalogs.
We have all kinds of technology,
like the DataMed system coming
out of the bioCADDIE community.
And so the good news is in this
seminar series on September 16,
Bill Hirsch will talk
about general problems
of information retrieval.
And on September 23rd,
Lucila Ohno-Machado from UCSD,
I'm sure will talk about
DataMed, the tool coming out
of bioCADDIE, which
allows us to access data
in a variety of repositories
in a generic way,
and to search for data so that
we can access the data we need,
and put data together
in ways which, I think,
will lead to new kinds
of experimental insights.
And so the data life cycle
continues to circulate.
We want to plan.
We want to collect.
We want to describe, and
preserve, and discover,
and integrate.
And, frankly, we're
in a stage where
we are recognizing that this
life cycle just circulates
over, and over, and over again.
As data become
available online, and we
can experiment with
this data-- and, again,
without even picking
up a pipette,
we can learn new things
about biomedicine.
And I think the buzzword that
is now permeating the community
that you'll hear a lot
about in upcoming talks
is this notion of making data
fair-- the idea that data
should be findable, that
it should be accessible,
interoperable, and reusable.
These terms came
out of a workshop
that took place in the
Netherlands a couple years ago.
And now they're being adopted
widely in the Elixir
community in Europe, the BD2K
community in the United States.
And this notion of
fair data, where
it's possible to find data
that may have been created
by other investigators,
where the data are
accessible because
they're stored in formats
that we understand,
that we can parse,
and that we can
interpret, that they're
interoperable because we can
relate our data to the data
that we have available
online, and make sense
of a variety of data sets.
And we use those data to create
new insights, to identify
as Khatri did, a genomic
signature for sepsis,
or a genomic signature
or active TB.
These are the kinds of
experiments and the kinds
of insights that are going
to come from the data
that we have online, and
frankly are going to increase
the effectiveness of the
biomedical enterprise
by orders of magnitude,
because not only will we
have the insights that come
from the initial experiments.
But we're going to be able to
have tons of investigators who
will be looking at
existing data sets
and coming up with insights that
were just impossible to make
at the time that the data
were originally collected,
either because the original
investigators never thought
of it, or because in
retrospect we understand more
about the world than we may
have understood previously,
and allow us to think
about new kinds of analyses
we can perform, and new
kinds of investigations.
That's why now,
40, 50 years later,
NASA is working really
hard to figure out
how it can try to recover
many of the datasets that
were created during
the 1960s, and stored
on media that can
no longer be read
because the kinds of machines
that read them no longer exist,
because now that we understand
all that we've learned
over the past 40 years, we
want to go back to those data
and reinterpret them.
And that's exactly what's going
to be happening in biomedicine,
as we try to go to
old datasets that
are available in
the repositories
that we manage throughout
the biomedical enterprise,
and start interpreting
those data anew,
in terms of all the things that
we'll learn in the interim.
That's the very exciting part.
And that's one of the
reasons that having fair data
is so important to us.
There's also, if you will,
a dark side that makes
fair data just as important.
And that is the
increasing mistrust
that people have had in science.
People may remember the
famous Economist front page
article that occurred about two
or three years ago, discussing
the problems that
scientists have
in being able to reproduce--
Was that a question?
OK, relating how Amgen tried
really hard to reproduce
the findings in 63 landmark
papers in cancer biology,
and claim that they only
do so in about six of them.
When Bayer tried to reproduce
67 preclinical studies, and said
they were successful in
validating the results
in only 25%, the
notion that many
of these landmark studies
that ultimately may have been
difficult to reproduce spawn
entire fields of investigation,
with no one bothering to
confirm the initial results,
but taking them for
granted, and trying
to do new experiments that,
basically and allegedly would
build on those
irreproducible results.
And, frankly, as
many of us know,
it is the non-reproducible
studies that
are more likely to get
published in the journals
and have the highest
impact factors,
because often those
non-reproducible studies,
frankly, have the
most exciting results,
even though it may be often
difficult in some cases
to reproduce.
And, of course, this
is a problem that's
gotten a lot of hand-wringing.
Lots of people talk about it.
It has lots of causes.
And some of them
are pretty trivial--
sometimes just not enough
statistical power in the study.
Sometimes, there's an art form
in conducting the experiments.
And the metadata
themselves don't
capture all that information.
There's often an eagerness to
publish earlier, or frankly
prematurely.
Rarely, there's fraud.
But, really, I think
for many of us,
we have a problem where we have
a system which traditionally
has not made it
easy or rewarding
for investigators to
replicate the results
of other researchers.
And we have a culture that
is only now recognizing
the importance of data
preservation, description,
and re-exploration.
But as we pay more attention
to the data lifecycle,
as we think about how we
can use other people's data,
how we can learn from
other people's data,
why it's important to
re-validate other people's
data, I honestly believe that
the world is going to change.
And at a minimum, we're
going to see science moving
to a situation where we have
even more open, online access
to experimental
datasets, where there's
an emphasis on annotation
of online datasets
with adequate metadata,
that not only will we
try to have the minimal
information about data
available online.
But I think,
ultimately, we're going
to see so much high
quality metadata online
that the online data may well
be just as useful as a journal
publication in understanding
what scientists have done,
and how to make
sense of their work.
There's increasing emphasis
on the use of ontology
to be able to annotate
metadata in a reproducible way.
And, ultimately, as
systems like DataMed
lead to better search
for experimental results,
we're going to be able to
find the data that we need,
and analyze them better.
Fundamentally, what
we're moving toward
is a world where
there's much more
concern about data stewardship.
And we recognize that
the life cycle is really
where we need to be
focusing our attention.
So, ultimately, we're
thinking about a revolution
in the way we talk about
science and we talk about data.
We're going to be
thinking about a world
where, invariably,
data are going
to be coming at us with
increasing volume, velocity,
and variety.
And as a consequence, the kinds
of experiments that we do,
and the way that
we manage our data,
are going to require
lots of new computational
approaches-- new approaches
that people are working on
right now, and that you'll learn
about in this seminar series.
We're recognizing that, because
data need to be validated,
and re-interpreted, and
re-explored, investigators will
no longer think of data
as the side effect,
or the end product, that
allows them to come up
with the real result
of the research
with a scientific paper,
but rather to treat
the data as the actual
purpose of the work,
and the data as
the end point that
is complemented by
the scientific paper,
but that also is
going to provide
the basis for future
experimentation
by a whole cadre of
investigators, who may not even
have been born yet, and with
the recognition that data should
be fair, findable, accessible,
interoperable, and reusable.
Investigators will
begin to need to plan
for data management in much
more comprehensive ways
before they even begin
their experiments,
during their experiments
as conditions change,
and even after
their experiments,
as they recognize that
there may be better ways
to make the data more
publicly available so
that other investigators
can take advantage
of the kinds of data
that are being produced
in the biomedical enterprise.
So there's a life cycle
that I've mentioned.
And if you think
about the lifecycle,
not only does this
provide a good framework
for thinking about how
data are collected,
and managed, and
used, and discovered,
and used to form the basis
of new experimentation.
But, basically, what
I think captures
all of the main high
points of data science
that you'll hear about
in this lecture series.
And so the most important
thing I can leave you with
is that coming up are going to
be some really fantastic talks
by some really great people.
And I hope that
what I've said today
will inspire you to
tune in every week
and hear about what's going
on on the cutting edge of data
science and biomedicine.
And let me leave you
with a couple references
that I think I found really
useful in thinking about some
of the things I've talked
about this morning,
and hopefully
you'll find valuable
as you think about
how you're going
to be taking advantage of
data management planning,
thinking about the
data lifecycle,
and putting it to
use in your own work.
Let me stop there.
And see if we have some time for
some questions, perhaps.
Michelle Dunn: Thank you, Mark.
This is wonderful.
You gave a great overview of
what we will see coming up.
And I know I, for one, am
really looking forward to this.
I think if there
are any questions,
we need to have
them by the chat,
because everyone logged
on is in silent mode.
So if you do have a
question, please just
write it in the chat box on the
right hand side of the Control
panel, or on your control
panel, wherever it is.
And Dr. Musen might have time
to take a couple of questions.
So there was one that came up
earlier under the question box.
And it's about fair data
and interoperability.
And it says, given the
challenges with legacy data
interoperability-- for example,
electronic health record
limitations, as far
as exchanging data--
will this lecture or
future lecture discuss,
or can you discuss,
how to reconcile
the need for open fair data, and
the underlying interoperability
challenges?
Mark, are you still there?
Mark Musen: Did I get muted?
Michelle Dunn: I think you did get
muted for a bit.
Mark Musen: Am I back?
Michelle Dunn: You are back now, though, Mark.
Mark Musen: OK, I won't take it personally.
That was a great question.
And it's more than I can
answer in a couple minutes.
But I think there are
problems that you point out,
because the way that we do
experiments change over time.
Sometimes, we talk about
microdata being re-examined.
Obviously, the kinds
of chips that people
use change over time, and being
able to go back and make sense
of some of the very
oldest technology
is always challenging.
A lot of the challenges
occur because the way we
think about the world changes.
So I mentioned ontologies
as being an important way
to allow for interoperability
among heterogeneous data sets.
But our ontologies
themselves change over time.
We know, for example,
that data that
were analyzed with old
versions of the gene ontology
will give results that
are different from data
that are analyzed with new
versions of the gene ontology.
And so I don't
want to say that we
have a panacea for dealing
with change over time
in these experimental data sets.
But we do have real opportunity
for really exciting research
in how to reconcile ontologies
that change over time, how
to ensure that data have
sufficient annotation that we
can back and reconstruct what
are all the assumptions that
may be relevant in trying
to integrate data that
were created at
different points in time
under different circumstances,
and recognizing that it's not
just a matter of mixing
and matching data sets.
But there really is
a science to being
able to innumerate what are
the assumptions behind data,
and to be able to reconcile
those assumptions as data
get integrated.
Michelle Dunn: Thanks.
There are quite a
few other questions
under the question box.
One of them that I
see on the very bottom
is, do you think
that this data life
cycle applies to
all kinds of data,
other than biomedical data?
Mark Musen: Yes, and I can say that the
first reference that you
see on the screen now is one
from the DataOne community,
which is concerned
with earth sciences.
And I think they represent a
different kind of community,
but have exactly the
same kinds of data needs,
and recognize exactly
the same problems.
I think the issues we're talking
about relate to all of science.
And I think they
also relate to data
that are not necessarily
created through
controlled experimentation.
I think they relate to
observational studies,
as much as they do
interventional studies.
I think they relate
to signal data that
are acquired opportunistically.
And I think to a
certain degree, they
relate to the kind of
data we get from EHRs.
So I think, in a sense, all
of these issues are relevant.
And I think we're going to be
seeing a persistence of data
in all these forms and
re-exploration of those data
from all those various
sources in the future.
Michelle Dunn: OK, I think we have time
for maybe one more question.
There is one here
about data sharing.
And it says that there
has been much progress
in sharing molecular data.
But there are challenges to
clinical data, particularly
challenges around privacy
consent and other issues,
and that investigators are
hesitant to share the data.
Do you foresee directions for
tackling these challenges?
Mark Musen: Well, I think one of the biggest
challenges in data sharing
is the WIIFM problem.
That's not a radio station.
It's, what's in it for me?
And I think investigators need
to be able to understand that
by making their data public,
not only will they learn more
because other people
may have insights
that they can contribute back.
But also, I think it's part
of the scientific enterprise.
And I think one of the
things that is changing
is a recognition that science
doesn't stop with publication
of a bunch of prose, but with
the opportunity of others
and the community at large to
be able to look at data sets.
That said, not only will people,
I hope, act altruistically.
But, clearly,
funders are thinking
about doing what they can
to ensure that the data are
made available.
And the basic
sentiment is that data
are collected as a consequence
of public investment
and research belong
to the public,
and therefore need
to be made available.
That is, obviously, going to
be different in the case of EHR
data, which have
a different value.
But at the same time,
I think patients
who recognize that the learning
health system as a model
is going to be making
medical care better
and more precise in the future,
also recognize that they may
want to make their
data available,
even if their providers may
not, because those data will
ensure that the care of future
patients may be more refined.
And so I think a
lot of this depends
on cultural change, and the
recognition among investigators
that data sharing is
important, and also
when it comes to clinical
data, recognition
on the part of patients
that they can help humanity
by making their data
available as well.
Michelle Dunn: Thank you so much, Mark.
I think if Jack would like
to say the last few words,
I am not sure they are able to--
I think their audio cut out.
But unless they want to jump
on, I think we should just say
thank you to our
speaker, which I
think everyone is giving
you a silent applause right
now, Mark.
But then also, thank you to
everyone who joined then.
Sorry about the slow
start this morning.
We will be sending out,
again, the new URL.
And I particularly want to thank
the training corporation center
for, at the last
minute, switching over
to a service that will
accommodate all of us.
So thank you very much.
And I hope that you all tune
in in the upcoming weeks,
because we have a great
program ahead of us.
Bye now.
