- Hi, good afternoon everyone.
Thanks for coming.
I'm excited about having
today's speaker here today.
John Ioannidis.
This is his second of three talks,
so he's doing the Michigan
Triple crown here.
So yesterday, he talked to us a little bit
about the world of bias.
Today he's gonna talk
about precision health,
and then tomorrow,
as part of the Department of
Internal Medicine Grand Rounds,
he's gonna be wrapping up that series
as well talking about reproducible
and useful clinical research.
So if you have the ability,
try to catch all three of the talks.
They're sure to be really remarkable.
I just wanted to start really
with just a brief introduction.
You could go on really all
afternoon talking about John.
He's the C.F. Rehnborg Chair
in Disease Prevention Department
at Stanford University,
where he's also Professor of Medicine
and Professor of Biomedical Data Science
and a Professor of Statistics,
and he also co-directs METRICS,
which is their institute on meta-research,
the study of how we
should be doing studies,
and John said an interesting
kind of career and path.
He was born in New York City
but then grew up in Athens,
and he's benefited from
actually experiences
on both sides
of the Atlantic.
But currently he's at Stanford University,
and it goes really without saying
that he's probably one
of the most original
and influential physician-scientists
of his generation.
I could go on and on with
this specific reward,
but I'm gonna share a particular story
which I've found really compelling.
If you do know John, you know him
through his magnum opus, right?
His paper "Why Most Published
Research Findings Are False",
a paper published in
2005 in Plus Medicine,
a paper that's been cited nearly
7,000 times at this point.
What's really fascinating about that story
is when you ask John well
how did you come up with that idea,
how did you write that paper?
It's a single-authored paper.
What I've learned over the last couple
of days is he did it,
as most of us would do,
on a vacation in Greek Island.
He wrote that paper in Sikonos,
which is a tiny island off the
southeastern part of Greece.
It's very close to other islands
that some of you might
know, Santorini, etc.
As opposed to those kinds of
glamorous vacation
destinations, this is an island
that has about 250 people in it.
I just picture John Ioannidis
in 2005 over a period of 48 hours
typing out what has become
probably one of the most
seminal works in our scientific
and meta-research fields.
It's a real honor to have you here, John,
and we look forward to hearing your talk.
(audience applauding)
- Thank you.
Thank you for the very kind invitation
and the wonderful introduction.
I will try to share some
thoughts on precision health,
big data, evidence-based medicine,
these are all terms that
have very strong friends
and very strong skeptics also,
probably among the audience and beyond.
What is evidence based medicine?
We have to go back to the original quotes
of the term by David Eddy, 1990, said
"Consciously anchoring a
policy, not to current practices
"or the beliefs or experts,
but to experimental evidence."
And this means that the pertinent evidence
must be identified,
described, and analyzed.
So you have experiment and
some effort at synthesizing
and putting together
information and analyzing it.
David Sackett probably has written
the classic definition of
evidence-based medicine.
He says "It's the conscientious, explicit,
"and judicious use of
current best evidence
"in making decisions about the
care of individual patients."
And I underline the word individual.
And that means integrating
individual clinical expertise,
again, individual is very prominent there,
with the best available
external clinical evidence
from systematic research.
That definition immediately
has two major components.
One is an individualized approach,
both at the patient level
and at the clinician level,
and therefore also at the level
of their interaction, their encounter.
And then a second component,
which is science, information, evidence.
The best possible evidence,
the best possible science,
put together in the most unbiased manner.
Seeds of wisdom and of
debate in these definitions.
Experiment, which practically means
randomized controlled trial.
Evidence-based medicine
got almost synonymous
with the advent of
the need to get better
and larger and to more
relevant clinical trials,
randomized controlled trials.
Systematic approach, systematic research,
and that translated to systematic reviews
and meta-analysis and being a tool
for promoting integrated evidence.
Individual patients, precisely so.
So here's the roots
that lead exactly to precision medicine
and individualized medicine.
And individual clinical expertise, again,
extremely well aligned
with we're talking nowadays
about individualized and precision health.
Over the years, many hierarchies
of evidence have been proposed.
And the most popular
ones have meta-analysis
and systematic reviews at the top.
Then you have randomized
controlled trials following that.
And very close to the level
of no value or even negative
value, you have experts
and even below that, you
have tweets or experts
or powerful people tweeting all the time.
Then, you have multiple types
of evidence that have evolved.
Clearly, the traditional
evidence-based medicine
has been dealing with meta-analysis
and randomized trials, so
clinical types of information,
but there's also observational evidence,
there's mechanistic evidence,
there's other evidence,
and if you look across 180
million scholarly documents
that are floating around
with 20 million authors,
there's lots of sand in
that desert of information
that a clinician is trying to go through
on a daily basis, multiple
times, back and forth.
And a scientist is trying to make sense of
and survive and get to some
oasis of real discovery.
We have also learned that
evidence is less than optimal.
Usually, these pyramids are destroyed,
like this poor destroyed
pyramid in Abu Rawash.
We lack the type of evidence
that we want to have
in place in order to have actions
that we feel certain are going
to do more good than harm.
Also, pyramids can be bulldozed
by property developers,
like this in Peru, and we
have learned over the years
that evidence can be severely affected
by conflicts of interest of
stakeholders who support,
sponsor, and develop,
and disseminate evidence.
Most of that is financial
conflicts of interest,
but increasingly, we are recognize that
there's other conflicts of interest
that could also be
important in some settings.
Some of them stemming from
just genuine human curiosity.
But still, you can get a
lot of strong allegiance,
bias even by people who
otherwise have good intentions.
There are ten views to
that standard picture
of evidence-based medicine.
One of them that
flourished in the mid-1990s
was thinking that N-of-1 trials should be
at the top instead of having the composite
picture from all the studies,
all the trials that have been done,
put together, in meta-analysis,
in a systematic review,
we should just look at what happens
at the single-patient level.
So N-of-1 trials, Gordon Guyatt and others
proposed pyramids that
had them at the top,
and many people favored that,
but very soon we realized
that N-of-1 trails were not really doing
what they were supposed to do.
So N-of-1 trials, which
are currently proposed as
the new wave of understanding
and promoting precision
health and precision medicine,
actually are a design that
was introduced in the sixties.
We got its basic methods
correct in the 1980s,
flourished in the early 1990s,
and abandoned about 20 years ago.
Now, they're being resurrected.
Why were they abandoned to start with?
Because they have these caveats.
They are not good if the disease
does not have a steady, natural history.
They're not good if
there's carryover effect.
They're not good if there
are priming effects,
and if the effects depend on the sequence
of previous choices that have happened
or have been utilized in the same patient.
They are not good if the
disease has a fatal outcome
in a relatively short course
because then you have no
time to test multiple options
and choose which one's the best.
And they're not good if there's poor
or unpredictable compliance,
adherence, tolerability,
meaning if there's real life.
Therefore,
N-of-1 trials met with all
of these challenges and
didn't really move very far.
So here comes precision
medicine or precision health,
as I see that you prefer
this term at Michigan.
We use exactly the same term at Stanford.
I'm not sure which one is best.
So what is that?
I don't know, I'm supposed to be an expert
in precision health, but
I have no clue what it is,
so I went to Wikipedia to find out,
and here's the definition
that Wikipedia gives,
"It's a medical model that proposes
"the customization of healthcare
with medical decisions,
"practices, or products being tailored
"to the individual patient."
So very prominently, again,
the individual patient.
We're back to the definitions of
evidence-based medicine of the nineties.
Now the individual, if you simplify
in some sort of semi-mathematical terms,
is one of the population,
the other extreme
of the population.
By definition, precision medicine
is thus aiming to have the
most tiny and the most
negligible impacts possible
at the population level.
I mean, this is the starting point.
We're trying to have an
impact on individuals rather,
therefore, we want to get
the most negligible impact.
Big data.
Again, I couldn't find a good definition.
I found probably about a hundred
different definitions on what they are,
and this is my attempt to define big data.
Big data,
it is data that carries
the least possible
information content per unit.
The more insignificant the content
of the information per unit,
the bigger the big data.
So why is that?
Because if we really had
information with a lot of content
that is meaningful and useful,
we wouldn't need big data.
Why should we waste our hard drives
and our time and our resources
and our computational time
if we could just measure one thing
and that would be the answer
to all of our problems?
The fact that we need to measure
all these information units to try
and build together
something that is useful
means that we are in a situation where
we are struggling with situations
with minimal information per unit
but hopefully so much information
that if you multiply the per unit
times the amount, hopefully
it can still be useful.
It's the the exact
opposite of Bradford Hill
request of what should we believe?
So Bradford Hill, one of the
fathers of clinical epidemiologists,
said that "I'm willing to
believe if something can be
"checked on the back of an envelope."
If it's two plus two,
or maybe an odd ratio
and a two-by-two table is fairly possible
to fit in the back of an
envelope, that's okay.
If it's more than that, forget it.
It's too complex for an epidemiologist,
let alone for an everyday
clinical practitioner,
let alone for a poor patient
who tries to get some benefit.
How can we have these two
opposites come together?
I would argue that we can
start from the premise
that precision medicine
or precision health
is the study of the most insignificant,
and then use one quote from
one of my favorite poets,
Odysseus Elytis, that "You'll
come to learn a great deal
"if you study the insignificant in depth."
So precision health
is a way to study the
insignificant in depth.
How deep can we go?
Before we decide how much in depth
we can go into the abyss,
let's try to see what do we have
in place and whether we can really
have the pathometer to reach the abysmal
end of all this data.
In 2018, evidence-based medicine has
lots of data.
Not necessarily reliable, but we have lots
of data both at the individual level
and at the population level.
Sometimes we have both
ends of the spectrum
being heavily armed in information.
We have big and small, deep and shallow,
broad and narrow types of databases.
We also have the
patient-clinician interaction
that is still there, but
is probably suffering
much of the time because of limited time
and because most physicians have to deal
with a computer and with data rather than
find time to even talk with the patient.
And we believe and we have evidence,
and pretty good evidence, actually,
that shared decision-making
is a good thing.
So if that information can communicate
and can be shared meaningfully
between physicians
and patients, that would be really nice.
What kind of information are we going
to share with patients?
Unfortunately, despite the fact that
we have tons of information,
very little of that is clearly useful.
This is an analysis that
we did a couple years ago.
We took 1400 topics in
medicine that had been assessed
in the Cochrane Database
of Systematic Reviews.
And of those, less than half,
43% had GRADE summary
of findings assessments.
GRADE is the grades of recommendation,
assessment, development,
and evaluation tool
that is trying to assess
what is the quality
of the evidence when you have data
from one or multiple trials?
What was happening in the other 57%?
Well, mostly, there was no evidence,
and this is why there
was no great assessment.
In these cases where we did have evidence,
looking at the first primary outcome,
only 13.5% of the time we
had high quality of evidence.
Even when looking at all
outcomes that had been assessed,
and where the primary set of outcomes
only 19.1% had at least one outcome
with high quality of evidence.
If you limit your focus to the reviews
that had high quality
of evidence available,
and significant results,
nominally significant just
with the typical 0.05 threshold
and a favorable interpretation
of the intervention,
meaning someone concluded
so this is a treatment
that is good to use,
only 25
out of the almost 1400 topics
had this type of situation.
So less than 2% of medical topics
had high quality evidence,
significant results
and someone said yes, go ahead and do it.
98% of the time, we had
modest or very large uncertainty
about how exactly to
deal with populations,
let alone patients.
One confounding problem is that
we don't have many discriminatory tools
that can tell us which among this 98%
of topics where we have uncertainty,
we can lean more towards saying okay,
maybe we have some evidence to act
or maybe we don't have enough evidence
and maybe we need more or maybe we have
some evidence not to do anything.
The problem is that our typical tools
that have been used for discrimination
since the times of R.A.
Fisher have become obsolete.
Almost all scientific papers claim
that they have found statistically
and/or conceptually significant results.
Obviously, all of my grant proposals claim
that what I'm planning to do
will be highly significant
one way or another,
although I mostly submit
mediocre ideas for funding.
And if you look across the entire
PubMed worth of abstracts
and full text papers,
96% of them report statistically
significant results.
This is an analysis that
we published in JAMA.
About two years ago we looked at close
to 15 million abstracts
in PubMed from 1990
until 2015,
and close to one million
full text articles
from PubMed Central.
96% of that literature
claims significant results.
Practically, whenever they had p-values
of some sort listed,
practically all of them
claim to be novel.
It looks as if discovery
has become so commonplace
that it's a boring nuisance at this point.
So how can we tell what to use
out of that huge mass of information?
To make things worse, almost any result
can be obtained unless we pre-specify
what kind of analysis we're going to have
and the availability of big data
is making that challenge even bigger.
This is the Janus phenomenon,
after the Greek Roman god
who can see in both directions.
And these are data from the
National Household Survey.
What I'm plotting for you here,
for example this panel is
whether alpha-Tocopherol,
or vitamin E, is associated
with the risk of death.
And there's a cloud of one
million different results.
This is the hazard ratio, and this is
the minus log 10p value that are obtained
for the very same question being addressed
in the very same database.
How do you get one
million different options?
Practically death can be affected
by many other variables,
so for each variable
that you can adust or not adjust,
the regression you have two options.
If you try to adjust or
not adjust for 19 options,
two to the 19 is one million choices.
And 70% of the results suggest that
vitamin E decreases the risk of death.
30% of the results suggest that vitamin E
increases the risk of death.
So Janus is looking in both directions
and depending on what you enjoy the most,
you can report that vitamin E is great
or vitamin E is horrible.
When results seem to be more credible,
unfortunately still we have the problem
that most of those are not
necessarily patient-relevant.
And one might argue that,
in an individualized precision setting,
we need to ask patients one at a time,
what exactly is relevant to them.
So that would be the ideal,
and then we have a different textbook
of medicine for each
patient one at a time.
However, there are some outcomes
that are important no matter what.
So I think that death, for example,
is a very important outcome regardless.
Some people may still say "I want to die.
"Please, get me to die."
Possible, but I think that still,
even for that patient, if we can make him
change his mind, and let him live longer
and live a good life, that
would not be really that bad.
So are there some outcomes that would be
ubiquitously important so that
regardless of the particular patient,
they would be interesting to know
what we can do with
different interventions
at the population or individual level.
There's initiatives like
COMET that are trying
to put together such a list of outcomes
that seem to be essential to study
for different diseases.
And in the case of preterm infants,
where for good or bad, the patients cannot
even express their wishes on what
is the important outcomes for them.
Things like chronic lung disease
are clearly very important.
You know, you can measure many, many
other things, but chronic lung disease
is a major problem for preterm infants.
So you want to know what
different interventions
would do in that regard.
When we checked more
than a thousand trials
on preterm infants, less
than a third of them
actually reported on chronic lung disease.
Two-thirds plus did not mention something
that was so ubiquitously
important to know about.
Another premise about precision
is that, since we're talking
about one patient at a time,
hopefully the effects associated with
that intervention should be really large.
If these individualized interventions
have tiny effects, who cares?
We've had so many tiny
effects floating around.
The big promise is that
now we'll put together
biology, lots of data in
formatics, complex analysis
black boxes, and will
get you a huge benefit
that if you have that profile,
you will really do much, much better.
What does our previous
experience telling us
when we look at evidence across medicine
in terms of how often
to we see large effects?
This is an empirical
evaluation we published
about five years ago in JAMA where we took
every single meta-analysis
that we could track
in the Cochrane Database.
There were 85,000 meta-analysis
that we could track,
and we asked how often
do we see treatment effects that are large
and that are also replicable,
meaning they are seen
in more than two studies, and hopefully
they have good statistical support
and hopefully they also
have no clear bias that
is visible and that would
invalidate our trust in them.
What we found is that
large effects are very common,
but they're very common in very small,
early trials and very few
of them survive scrutiny.
So if you ask for
a mortality effect,
where you have a five-fold
reduction of mortality risk,
that has been seen in two trials
with a p-value of less than 0.001,
and with no florid evidence
of something being wrong
in the evidence, along
these 85,000 meta-analysis,
there was only 1 topic,
extracorporeal oxygenation,
that achieved that kind of gold standard
status of huge benefit
clearly.
Is that all?
Are there no other interventions
that are as effective
as wearing a parachute
when you jump from an airplane?
Yes, there are more in medicine.
Obviously, if you have someone
with diabetic ketoacidosis
and you don't give insulin,
it's like letting them fall
from the airplane without a parachute,
or well, depending on how serious it is,
maybe jumping from the 10th floor,
but yeah, a small chance for survival.
There are such, but
Paul Klasiu has created
a list of such huge effects
that have been thoroughly validated,
large benefits clearly so.
And the list includes
about 20 interventions,
maybe 25 at the most.
What is happening is that we do see
lots of large effects like odd ratios
of five plus, even for things
of mortality sometimes,
but they're seen in very small studies,
on average with 12
participants' worth of data.
And whenever we perform yet another study,
the effect goes away.
It either completely evaporates
or it becomes much, much smaller,
so then it's questionable whether
it's really worth it or not?
Based on a lot of experience,
the data that we complied here included
about a quarter of a million
of randomized trials.
We know that very large
effects are not uncommon
to find when we're dealing
with small numbers,
which is the typical recipe
for most of the studies
that are currently being done
in precision medicine circles,
but most of those really
need to be validated
before we can be certain
that they're not flukes
and would not disappear.
Quality could also confound the picture.
Quality problems are prevalent
in clinical research,
much like in other disciplines.
Some of them may be unavoidable.
Sometimes masking, for
example, is not an option.
Randomization, very often, is
not as proper as it should.
Allocation concealments,
sometimes, cannot be guaranteed
and when it can be guaranteed,
it has not been adhered.
We have lots of evidence,
and this is summarizing
the result of the Brando
project where I was involved,
where we performed a meta,
meta, meta, meta-analysis
including several thousands
of clinical trials
and several thousands of meta-analysis.
Practically, we concluded that
if you have problems with randomization,
if you have problems with
allocation concealment,
if you have problems with blinding,
you're likely to get, on
average, inflated effects.
However, the average
distortion is relatively small
compared to the heterogeneity
in the distortion that you can get.
Most of these distortion effects
are much larger compared to
the precision effects that
we're chasing currently.
So, unless we can fix
these quality problems,
we don't really know whether these effects
that will emerge will be
true or would be spurious.
And, even worse, because
of the large heterogeneity
in the amount of distortion
that these biases introduce,
we will not be able to just correct
by saying that well, we
failed three aspects,
we didn't have randomization,
so let's correct by
dialing the effect size
by 10% and we will be fine.
It will be 10% on average,
but it could be 80% in some case.
It could be even in the opposite direction
in some other case, and this
is impossible to really know.
Another confounding factor,
much clinical research nowadays is done
outside the US and
Europe, meaning countries
that have a tradition of research,
in particular of clinical research.
A lot of research is done in China,
a lot of research is
done in Eastern Europe,
very often the price for running
these studies there is very low.
You can run a clinical
trial with one-tenth
or one-fiftieth sometimes of the cost
that you would need to do
in Michigan or at Stanford,
where the cost would be prohibitive.
Are these results unbiased?
Here's another empirical
evaluation where we looked,
again, across all the Cochrane Database
in situations where mortality outcomes
had been assessed in
European and American trials,
and also in trials done in countries
that don't necessarily have such
strong tradition of clinical research.
Here's one example.
If you look at calcium antagonists
in aneurysmal subarachnoid hemorrhage,
European and American studies show
practically no significant benefit.
It's very small, if
there's anything there.
And clearly nowhere close to
even a nominal level of significance.
If you look at a small study in China,
there's an 85% reduction in the odds,
suggesting that this is a
tremendously effective intervention.
Is it that this study is completely flawed
because it was done in China?
I think not.
If you look carefully, its quality scoring
may look pretty good, but
probably what's happening
is that there's many, many more studies
that have been done in China
or in Eastern Europe,
or in other countries.
And then, they are trying to get
through the bottleneck
of getting published.
And if you get a study that has
completely negative results in the US,
you know that it's not so easy
to get it published in the New
England Journal of Medicine.
Well, probably you can get it published
in some specialty journal.
If you get a negative result in China,
you will not be able to
publish just anything, number one.
Number two, you will not
get a financial bonus.
There's bonuses given by
the Chinese government
institutions that could be up to $100,000
for a paper in nature,
and few hundred bucks
for something that you can
publish in a more modest venue.
And three, you run the
risk of being flogged
with the whip because you
found a negative result.
So you have to take into account
what is the research environment
in different communities,
and how that might shape
the dissemination of
results from small studies
that may seem to have extravagant results.
On average, studies from
less developed countries
had an inflation of the odd ratio
for mortality of about 15%,
which is larger than the
average treatment effect
of the most effective treatments
that we have to curtail
the burden of death.
Here's another example
from the TOPCAT study
where investigators, and this from
the New England Journal
of Medicine, realized that
spironolactone behaved very differently
in American and European sides
versus sides in Russia and Georgia.
And trying to go into more depth,
they realized that the experience
of the patients of Russia and Georgia
was entirely different.
Apparently, probably, these people
had not even been treated
with spironolactone
or whatever was the assignment
they were supposed to be.
Further confounding by sponsor bias.
There's some types of clinical trials
that almost always favor the sponsor.
A couple years ago, we looked at
57 non-inferiority trials
with head-to-head comparisons.
We took the largest ones and we found
that 55 of them showed results that
were favorable for the sponsor.
The success rate was 96.5%.
I would argue if an experimental design
has a success rate of 96.5%,
is that anywhere close to equipoise?
Why do we need these studies?
We can get rid of them.
We can say well, this
study will be successful,
the drug will look very nice,
and move to the next step.
What is wrong here?
Is it that industry-sponsored trials,
and this is becoming
more and more prevalent
in the case of precision
interventions, are biased?
Is it that the industry has evil people
who are cooking up the data
and disseminating false results?
No, if you look at the quality
of these trials based on
our traditional checklist,
they look very nice.
The problem is that the design is such
that it's trying to optimize the chances
of getting a nice-looking result.
So choice of comparators,
choice of the non-inferiority margin,
choice of the setting, choice
of the exclusion criteria
as such that you're almost doomed
to get a nice-looking result.
This is very difficult to decipher.
It's a whole science behind the science
of how to always get nice-looking results.
Further problem: can we trust the data?
This is 329, and it's not a submarine
although it sounds like a U329.
It's a randomized trial that resurfaced
after 15 years after its submersion.
So when it had first appeared,
it was the pivotal trial that suggested
that in major depression in adolescents,
paroxetine and imipramine,
two antidepressants,
were very effective and very safe.
15 years later, independent investigators
got hold of the individual-level raw data
and they reanalyzed
the trial from scratch,
and they concluded that both
paroxetine and imipramine
are not effective and are not safe.
Entirely the opposite conclusion.
Can we trust the data?
We want to trust the data.
Unless we can trust what we read,
we are really into deep trouble,
because then how can we tell which cases
are like 329 and which cases we can trust?
Moving forward, here's a larger evaluation
of reanalysis of raw data
in individual trials.
We tried to unearth every single case
where a paper had been
published reanalyzing
an original trial for the
very same clinical question,
but it had been done in
a separate publication than the original.
We found 37 such cases in that paper
that we published in JAMA in 2014.
35% of the time, the conclusions
of the reanalysis were entirely opposite
or very different compared
to the conclusions
of the original analysis.
The original had claimed
that the drug is effective.
The reanalysis, that it is not effective.
Or vice versa.
The original had claimed that
this is the subgroup, the characteristics
of patients who need to be treated.
The reanalysis said nope,
this is the characteristics
of the patients who would
benefit from treatment.
What is going on here?
Is it that we have rogue reanalysts
who are trying to make
a fame around their name
and are trying to put the
original investigators
into disgrace just by manipulating
lots of fancy reevaluation
of the analytical space?
Actually, these papers are
almost always published
by the same original investigators
who published the first paper,
but it is happening in
an environment where,
if you spend extra time to
reanalyze your own data,
you cannot publish a second paper.
You cannot say that I spent extra time
and had a second look,
and I get the same result,
and I get a second paper.
Conversely, if you can
get a different result
or if you can say that you
get a different result,
then you can get a second paper.
It sounds very confusing, but this
is the incentive
structure that facilitates
this feeling of a
reproducibility in that setting.
Here's a very different environment,
where raw data are available routinely.
Two journals, Plus Medicine and BMJ,
have routinely made
availability of raw data
a sine qua non, a
prerequisite to publication,
and they encourage that
all data need to be shared
for any trial that they have published
in the last several years.
So we communicated with the authors
of all these clinical trials,
and we asked them to
send us all the raw data
from all the trials
that they had published
in these two journals.
We got 46% of them.
Is that half full?
Is it half empty?
It's close to half, though.
And then we spent time to reanalyze
all these clinical trials from scratch
to see what results we got.
That was not bad.
It's a pretty complex graph here,
but all these points are
very close to the diagonal.
If they were exactly the diagonal,
it would mean that we got
exactly the same results.
There were a few errors
here and there that we got.
I mean, we got in touch
with investigators.
They said yeah, thank
you for picking this up.
But none of these were such
that the conclusion of the
trial would be invalidated.
So in a different environment,
in a different culture,
where everything is supposed to be shared,
where you have journals that have
very strong enforced policies,
where you have the most
transparent investigators
who want to do that and
go through that policy,
and you have even the subset of those
who are saying here, go, take
my data and reanalyze them,
almost everything looks fine.
So is it that the culture is what matters?
Or is it that we just
deal with different types
of selection, biases
and selection processes?
Another promise is that, since we have
all these problems with the
traditional study designs
that depend on availability
of randomization
and huge expenses, maybe
we can replace them
with routinely collected data.
So there's tons of routinely
collected data based analysis
that try to estimate treatment effects
and our methods are becoming
more and more sophisticated
to try to incorporate
patient characteristics
that define populations
that can be almost matched
to a randomized equivalent.
So these are results of
routinely collected data studies
using propensity matching or
propensity score adjustments
versus randomized trials
that were done afterwards.
So when the RCT study was
released and published,
there was no randomized control trial
so that one could compare
notes, but then subsequently,
there were randomized trials published.
And these are mortality outcomes.
The average difference is a
31% difference in the odds.
It's huge.
It's like three times the
average effect for mortality
that we see even for very
effective interventions.
So if really, this is that bad,
clearly that's not a panacea for replacing
the paradigm of randomized trials.
Another solution is to
build prediction tools,
prediction models, prediction algorithms
into mapping the
individuals or the subgroups
or the substrata that
would do better or worse
or have different magnitudes
of treatment effect.
I'm very fond of that.
I'm doing quite a lot of
research in that space.
I still hope that
something major will arise,
but I think most of my papers
are probably not much worth.
This is looking at the field of
cardiovascular disease risk prediction
and this is the number of articles
that present new predictive models
for cardiovascular disease.
There is about 400 different models
that have been presented in the literature
for cardiovascular disease
and as you well know,
when ACC/AHA decided to
release new guidelines,
they looked at available predictive models
and they said none of them is good enough,
so they developed yet a new one
that everybody feels that
is one of the worst models.
It's not calibrated, it's
completely misaligned
with risk levels and so forth.
So how can we really find predictive tools
that would be validated but also useful
and that would be, there
would be some consensus
that people would like to use?
The alternative approach is that
why should we have everybody
using the same model?
Maybe each hospital, each
center can use their own model.
So we have a new wave of lots of studies
that take electronic health records
and they develop a local model
that actually can be
reevaluated and updated
and upgraded and changed in real time
as we get more information and more data.
So this is an empirical
evaluation that we did
with Ben Goldstein and
his colleagues at Duke.
We looked at EHR-based predictive models
and this is the sample
size that they used.
As you see, some of them
go up to a million plus
of individuals, which is very good.
The problems of small
studies that we were facing
with randomized trials clearly is gone.
We are facing the exact opposite
of just having over powered situations.
The number of events, again, very nice.
This is 10,000, sometimes
even more than that,
and the number of variables sometimes
we can include a thousand variables,
even 10,000 variables in the model.
Try to visualize and communicate
to a physician or to a patient,
this is a model with 10,000 variables
that we have included.
That would take probably
four years to explain.
So you just say it's
really good, trust me,
it's gonna be fine.
How well do these models do?
They don't really do that well.
I have to say that I was disappointed.
I was expecting to see some AUCs
or some reclassification rates that would
be much better than what we see.
On average, the AUCs were like .7.
Like the oldest models that we had
in the 1940s when we could
only measure CBC and creatinine.
Having lots of variables,
I'm not saying it's a bad thing,
but most of these models
don't really add up much
in terms of predictive ability.
I don't need to remind
you that subgroup analysis
have a tradition of leading
us down the wrong path.
There's a lot of discussion for many years
even proceeding precision
medicine about using
subgroup and effect modification
and stratified medicine.
And the classical example
probably is from that paper
in The Lancet where these
are the months of birth.
So this is the zodiac cycle, horoscope,
and so this is the absolute risk reduction
from corroded endarterectomy in people
with systematic stenosis.
Clearly, huge subgroup differences.
Heterogeneity p less than 0.0001,
based on the zodiac sign, but obviously
this is 0% likely to be true.
If you look across the literature
of such subgroup claims,
which are like the first step to inform
some sort of stratification
towards personalization,
most of the claims that have been made
in the literature cannot be reproduced.
We published a couple of
papers where we looked at
all the sex based subgroup differences
across all Cochrane meta-analyses
that we could identify
in randomized controlled trials
and we found very little that
was reproducible across multiple trials.
There were lots of sex differences
if you looked at single trials.
They were just very common.
But something that was seen
again and again, very uncommon.
How about if we combine lots of databases?
Lots of randomized trials.
We perform meta-analyses of raw data
of individual-level data, and then try
to identify closely validated
subgroup differences.
This is another empirical analysis
where we looked at all the meta-analyses
of individual-level
data that have been done
to date and these are the results
for subgrouping variables
that seem to discriminate
between different groups that have
different behavior to treatment.
These are the p-values
for individual-level
subgrouping variables and for group level
where the entire group has the same value.
For example, the type of
dose that is being assigned.
P-values of
0.05 or close to that and 0.01
or close to that really
not very nice looking.
If you translate them to base factors,
they translate to base factors
of something like 2.5 to 5.
For basic perspective,
this is okay to mention,
but is it likely to be true?
No, not really.
Probably a small minority
of those may be true.
If you look at the magnitude
of the effect modification,
so how much difference
in the treatment effect
do you get in people who have
different types of covariates?
The average magnitude is less than .2.
.2 traditionally, in a
standardized different scale
is the threshold for a small effect.
So the effect modification
average is less than small.
Eventually, we want to get
something that's useful.
This is what evidence-based
medicine was supposed to do.
This is what precision medicine
or precision health was supposed to do.
But getting useful clinical
research is not easy.
In that paper in Plus Medicine 2 year ago,
I tried to come up with eight criteria
or features of what we really want.
And it's the same regardless of whether
you believe in evidence based medicine
or in precision medicine.
First, we need to have a problem.
We need to have a problem to fix.
If there's no problem to fix,
just creating problems, creating diseases
that don't exist, just
moving the definition
so that everybody becomes sick
and needs treatment,
that's not a good idea.
Second, we need context placement.
We need to know what we
already have available
in terms of information.
Maybe we have none, or
maybe we have 522 trials,
as we do in antidepressants.
And then why do we need another one?
We need information gain.
Is that new study small, big, randomized,
nonrandomized, going to tell us something?
If not, or if it's going
to tell us something
only if it gets a
particular type of result,
this is not a good way to decide to do it.
Pragmatism, does it reflect real life?
Or if it deviates, does it matter?
Patient centeredness, does it reflect
top patient priorities?
Have we asked patients
what do they really care
about with the background
disease that they have?
Value for money, is the
research worth the money?
There's formal ways to assess
that with proper tools,
but it's very rarely done.
Feasibility, can it be done?
About 35% of randomized trials in surgery
are abandoned because of utility.
They are assuming that they can get
50 patients on board, but
after six months or one year,
they have only enrolled four or five.
And finally, transparency,
are the methods,
the data, the analysis
verifiable and unbiased?
And this is where reproducible research,
open science comes into play.
Where do you find these studies?
If you look across the literature,
most papers that you will circulating
in respectable journals don't meet more
than a couple of these
criteria, even none.
If you decide just to read New
England Journal of Medicine,
Lancet, JAMA, like the top of the top,
again, most papers will meet very few
of these criteria and even though
they will be better, on
average, in fulfilling
some of these requirements,
most of the good studies
would not necessarily be there.
Do we need to speed up or
do we need to slow down?
That's another question.
Facing these challenges,
one option is to say,
well we have all these problems,
thank you for mentioning them again,
you're such a bad guy to mention them.
But the way to move forward is
to just get as many
interventions out there.
Get them licensed and we'll sort it out.
We'll let the dust settle.
We'll do some studies after the fact
and then the real winners will emerge.
This is actually an idea that is not new.
It has been going on
for over 10 years now.
It's very prevalent in
cancer, prevalent in HIV.
I think HIV is a nice success story.
I was at NIH when we ran the
pivotal clinical trial ACTG320
that showed that you can have a treatment
that can completely change
the course of a disease,
a huge success.
And some other diseases have this kind
of accelerated approach.
What have we learned?
First, we have learned that the studies
that are supposed to
be done after the fact,
they are not done.
Once you have something licensed,
then it will go mostly into the mode
of being evaluated for yet new indications
beyond the ones that it
has been already approved.
Nobody will go back, or
very few people will go back
and try to make sure that
what we approved it for is really valid.
These trials of using that
new treatment as background,
meaning that they take it
for granted that it works,
but then they build
additional new interventions
that are also eager to get
licensed are happening very fast.
Within one year, we are shifting to
yet the new target being licensed.
If you compare the time of the trials
where the intervention is tested
for the initially approved indication
versus other indications,
there's hardly any difference.
People take it for granted that
since it got licensed for something,
it's good for everything,
which is exactly the opposite
compared to the acceleration happening
with the precision mentality in mind,
that we're proposing accelerated approval
because it can really
affect that particular type
of individual with these
specific characteristics.
The other mode that is
becoming more common
is to approve based on
nonrandomized trial data at all.
This is an analysis that we did with
all the European
Medicines Agency approvals
over the last 10 years
and we're doing the same
with all the FDA approvals.
Roughly 10% of approvals
of new medications
and biologics have absolutely
no randomized data.
If you look at the effects sizes,
some of them look pretty big.
This is absolute risk difference
and this is odd ratios,
and on average, you see some
fairly large odd ratios.
So something like 12
odd ratios, really nice,
but the effects sizes could
be all over the place.
Sometimes they're close to zero.
There's still drugs that get approved
with nonrandomized data with effects sizes
that suggest absolutely no benefit,
but there's some biologic rationale
that people defend it with that,
well it works on that mutation
and we've seen this to be important so,
therefore, it must be effective.
Systematic reviews and meta-analyses
are not going to help necessarily
much in that situation.
We have lots of them.
We have exceeded 100,000
published meta-analyses nowadays.
It's more of an epidemic that is evolving
but I think that if you have this type
of starting block of information,
you're unlikely to be useful.
We start seeing some weird phenomenon.
We see, for example,
China becoming a champion
in meta-analyses because again,
you can get some money out of each paper
that you publish there.
This is genetics.
Genetic meta-analyses.
Nothing was being published from China
but in 2014 the represented about 17%
of the global production in
English language journals,
serious journals, and now
they're about 85% of
the global production.
There's lots of contractors that publish
meta-analyses and tons of them.
There's a hundred
companies that you can pay
and get a meta-anlysis, and
then if you're the industry
you can decide to publish it or not
depending on what the results look like.
And it will be very nice looking,
you pay a little bit, but
that's not a lot of money
compared to other areas of
research and development.
As a result, we get tons of meta-analyses.
These are network meta-analyses
which is the most difficult design to do,
but in some cases, we have
up to 28 meta-analyses
on the same topic like biologics
for rheumatoid arthritis.
None of them is exactly the same.
They cover different treatments
and none of them gives the same coverage.
None of them covers more than 50%
of the studies that have been published.
And none of them agrees with each other.
These are the results.
Again, if you want to pay,
you can get a meta-analysis
to give you the result
that would fit your agenda.
This is one practical example,
since I surely got you depressed
with what I have been describing.
Which is the best antidepressant to use?
These are four network meta-analyses
done by very good friends of mine.
They're the best
meta-analyses in the world.
I know that because they're my friends,
so I want to boost them
before I destroy their work.
Peroxetine, according
to this meta-analyses,
is the best antidepressant.
According to that one,
is one of the worst.
Sertraline is the second best.
Here, is the worst or
one of the worst here.
Choice of how exactly is going to be done
can make a difference.
In a way, meta-analyses
can be the last step
of a marketing tool that you can get
the conclusions that you want to see.
Among 185 meta-analyses of
antidepressants for depression,
when an author,
who was an employee of the manufacturer,
was involved, there was
22 times higher chance
of not having any negative statements.
Actually, there was only one case
where you had industry involvement
and some negative statement.
And I'm not even sure that these people
are still employed by that company.
Only 3% of meta-analyses are both decent
and clinically useful.
I think this is the best design,
and when it works, it can help a lot.
So evidence based medicine is not dead.
When you identify these
3,000 meta-analyses papers,
they can really be helpful
and they can tell you
what to do for population averages
and perhaps also for a little bit
of individual effects, but
they're not that common.
Last possibility may be getting from
the individual effects, much
larger population effects
for the same pathway.
The classic example here is
PCSK9 inhibitors where
seemingly, we have a
very nice success story.
We started from a genetics discovery.
We identified a gene that seemed
to be very important in
familial hypercholesterolemia.
And it also had variants that would
affect the risk at a population level.
Then, developed a drug that was based
on that target and we see that wow,
we can really bring LDL
down across the population.
Was that really a success story?
If you look at the price
of that intervention
is so high, that clearly, it's cost effect
is not something that is desirable.
And obviously, we still don't know whether
the benefits in lowering LDL
and in some clinical
outcomes, would also translate
to mortality benefits, but apparently,
doesn't seem to be so prominent.
I would close by saying that
we may need to reverse the paradigm.
Maybe move from poor primary data,
retrospective reviews, and this type
of fragmented information
to perspective meta-analyses
is the key type of primary research.
Think about what are the problems
that we want to solve and design
a clinical agenda where
everybody working on that field
will join forces worldwide,
the data will be
incorporated prospectively.
There will be individual data,
but they will contribute towards
the same overall analysis.
We can design what type of
comparisons we want to test.
We would design the next study
based on what we have in the
composite evidence until now.
I think it's unlikely we will be able
to fuel precision medicine or health
in individuals until we can obtain
large-scale coordinated evidence
on large populations to start with.
To conclude, evidence based medicine
has been with us for a long time.
It has acquired tremendous power,
but most of the medical
evidence is either problematic,
spurious, or false or has no utility
for medical and shared decision making.
The main utility of systematic review
in meta-analyses has
been to reveal problems
with the biomedical evidence.
Precision medicine health aims
to specify one of the main pillars
of EBM to deal with
individuals, which is nice,
but by definition is likely
to have minimal impact
on life expectancy and other
major population outcomes.
Still, a synergy between
large-scale evidence
and precision approaches would be useful
to tell us what we can learn from both.
Expectations of replacing experimental,
randomized evidence
with nonrandomized data
need to be tempered.
I'm not saying that it cannot happen,
but most of the time, we
will need randomized trials.
Conflicts of interpretation
need to be minimized
for primary data, for trials,
for meta-analyses, and for guidelines.
And prospective building
of research agenda
may become the gold standard, hopefully,
for primary clinical research, both at
the individual and the population level.
Special thanks to some of my colleagues
who contributed to some of the work
that I shared with you today,
and special thanks to all of you
for tolerating me at 5:00 p.m.
Thank you.
(audience applauding)
- Any questions?
I can start with one.
So John, you've been
thinking about these issues
for several years now,
and painting a little bit
of a depressing picture.
Do you think that, in some ways,
this movement towards precision health
and big data will actually solve some
of these issues in just
kind of a general sense?
Or do you think it's going to be fuel
that's just going to add to the fire
and it's gonna make these problems
even many folds worse
as we think about things
over the next decade?
- I think it's up to us.
It can go either way.
I think that if we realize the challenges
and build on circumventing them,
and generating more
openness, more transparency,
reduced conflicts, larger databases
with more accurate measurements,
we have a chance of addressing some
of the longstanding problems
that we have not solved with
clinical research until now.
Conversely, there's a great opportunity
of having more data, and
therefore more errors
rather than more discoveries
and more useful observation.
So I don't want someone
to come out of this as
oh, we're doomed and nothing can be done.
We have lots of opportunities.
We have lots of tools.
We have lots of possibilities.
The question is how exactly are we going
to use them to get something useful?
These eight criteria could be some sort
of guidance of how to use these new tools.
- Essentially I'd like to build on
the previous question a little bit.
And this actually comes more
from your talk yesterday,
but it relates to this topic as well.
I work in implementation
science, implementation research.
At the end of your talk yesterday,
you outlined a number of
steps that could be taken.
Things could be implemented.
In a sense, your criteria are
also things that could be implemented.
I'd like your thoughts on
what it will take to do that,
if you think of implementation
as behavior change,
and you think about the scientific and
medical research complex
in this country, internationally,
as the substrate of people
whose behavior needs to change.
What will it take to actually
get that change to happen?
- I think to achieve change,
you need to have the major
stake holders aligned.
And the major stakeholders are scientists,
funders, journals, professional societies,
institutions, and also the general public
and patients could be quite influential.
You don't necessarily need to have
all of them aligned to
do exactly the same thing
but they need to be sensitized
and they need to recognize
that these are issues
that need to be tackled and it's important
to move in that direction.
Once you have some of
these stakeholders moving,
and the others will follow.
If you have just one,
it's not gonna happen.
So no single journal alone
can change the world.
No single scientist alone
can change the world.
But if you have multiple stakeholders,
you will see things happening.
And every successful transformation,
like registration for clinical trials,
happened when you have multiple
stakeholders being aligned.
So you have both journals saying
that I'm not gonna publish
your paper unless it is,
if it's a trial, unless
it's preregistered,
and you'd have funders who would say that
you need to do that, and
you had clinical trials gov
saying that here is how to do it,
and you have to do it.
And you have regulations trying to
incentivize and ask for it.
So we need, I think, a critical mass
for these changes to happen.
Data sharing.
It was very rare and this is why
we only came up with 37 reanalyses
that had been published in 2014.
Right now, there's more
than 10,000 clinical trials
worth of data that someone
could access as raw data.
It's still a small portion
compared to roughly 1 million,
but I think we've seen some action.
Statistical tools analysis, again we see
some evolution over time.
I'm not necessarily pessimistic.
I think we do see changes, we
do see some paradigm shifts.
At the same time, our challenges
are becoming incrementally,
and sometimes geometrically,
more pressing.
Cause we have more data, more analyses,
more people eager to create
these sort of results.
- Can you comment a little bit about
how heterogeneity can influence
the results and how it can be incorporated
into meta-analyses or trial and error?
One of the thoughts that I found,
one of the stories I found mind-boggling
is for about 20 years, there was a drug
that worked in a very
specific kind of lung cancer,
and it might have been,
I forgot the exact name,
it was acting the tiniest
domain of EGF percent,
and it worked really well.
And about 3% or 5% or
depending on ethnicity etc.
of cancer patients, it failed
every single clinical trial,
and therefore, for a long time, in Europe
it was allowed to prescribe
in certain types of cancer,
but in America, it was
only allowed to be taken
by patients who had positive
reaction to it before.
I believe that Tacinol changed.
My worry is that your example
of your antidepressants
or whatever, that there
may be tons of such drugs
that work really well
in 3 or 5% of patients,
but that are not, that are
failing clinical trials
and the only reason this
particular one was used
is because we understood the
mechanisms and it made sense.
As long as we don't
understand the mechanism
there may be such heterogeneity all over,
and we may just be throwing out
all of those drugs because
they only work in the subset.
- I take your point, and
I think that it's likely
that there are drugs that heterogeneity
is masking effectiveness
for specific subpopulations
and we don't know the
mechanistic substratum
that would be the answer to that.
I don't know how common that is, though.
So, for example, cancer
is the one discipline
that has probably made most progress today
in terms of applications of these sort
of personalized treatments that are based
on this type of biological heterogeneity,
where you have a mutation
well matched to the treatment.
But you can see that in
the super umbrella trial,
NCI-MATCH, only about 2.5%
of patients with cancer
can be matched to such a mutation
that would be recognizable
and you can have molecularly
targeted treatment.
Even in a pretty successful paradigm,
it's still a small minority.
If you look at, across all cancer trials
at the moment in oncology, there's about
150 trials that have
personalized designs in process.
Not finished, but registered
in clinical trial gov.
So basket trials and umbrella trials
and personalized designs
that might fit to that concept.
As opposed to about 50,000 trials
that just have the
typical bread and butter
average treatment effect approach.
We need to do more of those.
So we need to go after this
type of matching biology,
matching mechanism to
understand heterogeneity.
But I don't want to also
reach the other conclusion
that in every case, we will
be able to find something.
Antidepressants, for
example, have been out there
for more than 50 years.
We recently published a meta-analyses
in Lancet with 522 trials
and 120,000 people.
If you look at the literature,
there's hardly any biological marker
that you can reliably use
to individualize treatment.
It's like a holy grail.
We want to individualize and
there are some possibilities
but nothing really that is as rigorous
as a mutation linked to a specific
biologic monoclonal antibody in cancer.
So I think the challenges are different.
Maybe some cases, we are just
looking down the wrong path.
Maybe our whole thinking about what type
of treatments we want
to go after is wrong.
In other cases, maybe
we're on more solid ground
and we need to be open to challenge that.
- One more question.
Anyone?
Right down here.
- John, have you found
any sense of trials,
for people who have taken advantage,
especially with cancer trials,
of the evidence of
differential expressions,
splice isoforms, and key
steps and key pathways.
There is this notion that our
transcript analysis sometimes
called gene analysis.
Which it's not.
And if you recognize at the base
that evolution of developments
is an excellent newfound
structure of genes,
and splicing that goes on to produce
more parts for this gene.
And then you just lump them all together
and assume they're at all equivalent,
which they are generally not.
So I believe the
(mumbles)
and trying to confirm biomarker result,
which is generally fruitless,
or to reliably treat,
even for a mechanism,
there's a lot of variation in the response
most aligned to this placing
variation that is neglected.
- I think it's a clear possibility.
It's a mechanism that I think
has not been explored
to its full potential.
I would argue it still needs to be matched
to problems that remain unsolved.
Coming back to the very first criteria
of feature of what we
really want to get because,
in terms of tumor
biomarkers, there's about
2,000 new tumor biomarkers
proposed every year.
And we had actually published a paper
a number of years ago where we found
that 99% of that
literature, is even higher
than the average 96%,
claimed significant results.
Even the few that don't
claim significant results,
I remember that 1% include a paper
that had 125 p-values of that biomarker
scattered throughout the text and tables,
none of them close to even 0.05,
but the conclusion of
the abstract was that
this is a very important tumor biomarker
as we have shown in our previous study.
We have lots of leads and some of them
are more exciting than others,
but we don't have a very rational way
to prioritize leads that
would be more fruitful.
And we just drown in a sea of
tens of thousands of biomarkers,
only a couple of dozen already adopted
and used in clinic,
and very big space of a gray zone
that is unexplored and very fragmented.
So I think that this is one such example
where you may have a clear winner,
but it's just lost within
that space of noise.
- That's wonderful.
So, maybe one more round of applause.
(audience applauding)
