Welcome to the 2019 Robert S Gordon jr.
lecture in epidemiology. My name is
Stephanie George I'm an epidemiologist
in the office of disease prevention and
I'm pleased to represent our director
Dr. David Murray today and introduce our
speaker. Before I do that I wanted to
invite everyone to join us after the
lecture for a reception in the NIH
library directly to my left. The Robert S
Gordon jr. lecture is awarded each year
to a scientist who has made major
contributions in the area of research or
training in the field of epidemiology or
the conduct of clinical trials. The award
was established in tribute to Robert S
Gordon jr. for his dedication and
contributions to the field of
Epidemiology and his distinguished over
30 years service to NIH during which
time he served in numerous senior
leadership positions. Now let me
introduce our speaker Dr. John Ioannidis. John comes to us from Stanford
University where he is the CF Rehnborg
chair in disease prevention, professor
of medicine of health research policy
and by courtesy of biomedical data
science and of statistics he's also the
director of the meta research innovation
Center and the director of the ph.d
program in epidemiology and clinical
research. Dr. Ioannidis received his MD
from the national university of athens
in 1990 and also received a doctorate in
science in bio pathology from the same
institution he trained at Harvard and
Tufts in internal medicine and
infectious diseases his awards are
numerous as are the disciplines his work
spans I encourage you to check out his
his biography online and I will now with
no do no more minutes introduce our
speaker. Dr. Ioannidis' his presentation
today is titled in scientific method we
don't just trust or why replication has
more value than discovery please join me
in welcoming our 2019 Robert S. Gordon
Jr lecture award recipient Dr. John Ioannidis. Thank you for this tremendous
honor and for asking me to give this
lecture today on this topic it's always
a pleasure to be back at NIH and to to
meet with colleagues and to meet more
colleagues very bright people and this
is a stellar institution you know very
unique in the world.
So in scientific method we don't just
trust a trust is great to have but we
need to have more than just trust
science is a fantastic enterprise very
successful about a hundred and eighty
million papers floating around this is
just 20 million of them from the last 16
years they create a universe along with
2 million patents and 200,000
disciplines of science I'm pointing with
an arrow here and it's very tiny font
but let me magnify that "hi there my best
paper is a speck of dust in a speck of
dust in a speck of dust somewhere around
here." No single paper can ever compete
with its surrounding or with with
science at large science is a communal
effort it's that whole galaxy that
matters it's it's a living galaxy and
evolving galaxy with plenty of data, with
plenty of hypotheses, with plenty of
inferences. At the same time that galaxy
also has plenty of empty space plenty of
dark matter and what would that dark
matter be it would be an analysis that
have never been published you know that
have been done but are not published. It
would be data that are available but
they're not really available the were
accessed maybe once by someone but then they were lost it would also be science
that was not done it would be
replications that were not done because
it was thought to be "me too" research and
also lost opportunity
is about research that never flourished
because it was not funded because there
were not enough resources to do it. So
how can we look at the existing matter
and try to understand how we can expand
that universe further much of the
emphasis has been on making discoveries
but to be honest I think that discovery
is a boring nuisance almost every single
paper that you will read in the
literature will claim that it has some
novel results. Every grant that I will
submit to NIH I promise to find
something novel and I know that probably
I'm lying statistically speaking. This is
a text mining exercise probing the
entire PubMed from 1990 to 2015,
96% of the biomedical literature claims
significant, statistically significant,
and mostly novel results whenever
p-values are used either in the abstract
of the paper or in the full-text papers
that are available in MEDLINE.
Translation proceeds at a glacial pace
despite having all these millions of
statistically significant and seemingly
novel results we do make progress but
it's not millions of discoveries that
move to have that tremendous impact on
medical care, on patient outcomes, on
health, on living better and longer lives.
Several years ago we looked at the creme
de la creme the most highly cited
clinical research ever published these
are shown here with red landmarks
milestones and we asked how long did it
take to get there we have one of the
most highly cited papers in the history
of Medicine. On average it took about 25
to 30 years in the case of nitric oxide
it took about 200 plus years. There are a
few exceptions where things materialize
faster for example I consider one of the
most exciting experiences in my career
when early on as a scientist I was at
NIH I was involved in ACTG 320
a randomized trial
that showed that with triple therapy we
could dramatically decrease the risk of
death and disease progression for HIV
infected patients. That trial was
published within less than four years
from the time that protease inhibitors
had been developed based on a
concentrated and communal effort to try
to do something about HIV and it did
work. So how can we shift from stories
where it takes 200 years or 30 years two
stories where it takes four years and
also how can we save not only the timing
but also the credibility of these
efforts? Because in that slide here I
also have some black milestones which is
when larger better control studies were
done that showed that that creme de la
creme paper was actually either
completely wrong or exaggerated in terms
of what it was promising to have found. I
think that we are incentivizing a fake
narrative, the narrative that is dominant
is that we have an oversupply of major
true discoveries and I think that this
is currently untenable. I don't need to
remind you how few new substances, new
drugs get licensed, and even though we're
trying to push that agenda and we have
the brightest minds in the world working
on science the effort that we put is
tremendous and the real progress that
you can measure in any way you want is
far more limited. Now this narrative
coexists with some additional urges. One
of them is make haste,
you know rush, patients are dying license
new drugs immediately and I do feel that
pressure, and I think that yes we need to
make haste, but this doesn't mean that we
should cut corners in terms of the
evidence that we need to accrue. Second  urge is to be methodological sloppy so just
get an answer right away we have all
these huge routine
collect the data datasets no need to do
randomized controlled trials just run
something through the mill of your
computer and that's gonna be accurate
and very often this is not accurate and
the third urge is avoid replication
don't waste time with replication this
is "me too" effort we need to move forth to
even more discoveries. Now the value of
discovery can be modeled so here's a
very simple model let's say that R is
the pre-study odds of the research
finding being true bf is the base factor
that is conferred by the discovery data
H is the ratio of the weight of negative
consequences from a false positive
discovery claim versus the positive
consequences from a true positive
discovery. Then the value of the
discovery process is proportional to
true positive minus H times false
positive or if you work through that it
is proportional to R times the base
factor minus H and obviously you can add
a constant if you have two scientists,
2000 scientists, 2 million scientists you
can multiply that with how much effort
is going into this. R and H are rather
field-specific
if I have wasted my life in a field
where there is nothing to discover
that's very sad but that field just has
nothing really useful to discover. If I
want to make progress I just need to
change fields. Sometimes maybe there are
things to be discovered but they would
require a completely disruptive approach
abandoning what we do and just taking a
completely new approach to trying to
attack the same question. H is also
pretty much field specific so the focus
should be, must must be, on increasing the
base factor of the work that we do the
information content of the work that we
do. The options for increasing the base
factor are numerous and they depend on
what field we're working with but
typically running larger studies so
getting more evidence would help and
also ensuring greater protection from
biases would help which means that we
optimize our design our statistics are
thinking ahead of time about how we
should go about answering a question. In
the very same equation it's very easy to
get negative values if you want to avoid
having negative values from Discovery
then one needs a base factor to be more
than the ratio of H over R and often
this is very difficult. Why is that? Most
original discoveries are claimed to come
from small studies where biases are very
common and therefore the base factor
often is even less than five which in in
the Bayesian work means very very little
and also most fields currently are
working in areas where the pre-study
odds are pretty low. We're attacking
massive spaces of exploration where
there are signals to be discovered but
the denominator of all the potential
things that we can measure and all the
things that we can analyze is probably
very large if you have that combination
most discoveries are operating in a
space where they have negative
scientific value. It's far more likely
that they will confuse us that they will
generate false negatives that would lead
people astray and more resources wasted
downstream building on something that is
a false negative claim or an exaggerated
claim rather than that they will save
the world. How do we sort out where to go
next?
We need replication we need to take all
of these tentative discoveries and try
to replicate and see what still survives
different efforts to reproduce the
results either exactly the same way or
with different angles of triangulation.
Is that new? Not necessarily, this is
pretty much how science started in the
Western world you had to replicate and
reproduce findings in front of the Royal
Academy and people were watching to see
whether the apparatus would work.
Currently we mostly trust that what was
done behind closed doors or behind a
closed computer is something that we can
put trust on. We do have empirical
studies on fields where replication is
not just considered to be a "me too" type
of effort and actually replication
practices are common and these empirical
studies suggest that most of the
initially claimed statistically
significant effects are either false
positives or substantially exaggerated.
One such field is genetics. Genetic
epidemiology went through an immense
transformation over the last ten years
moving from candid gene studies where
people had to come up with single
hypotheses to test in an agnostic
fashion with much larger sample sizes
and consortia the whole genome more or
less in terms of association with
phenotypes. By doing this one could go
back and assess how often the papers
that were published in the candid gene era
were successfully replicated through
genome-wide Association approaches, the
replication rate on average was 1.2
percent this may be an underestimate
because of power considerations I'm
willing to take that to five percent or
maybe ten percent at most but more than
90% of these papers that were published
for many many years for over 20 years in
the very best of our journals were
probably not saying much and probably
just false positives.
Here's another evaluation animal studies
there's tens of thousands of animal
studies being done and I think that
they're tremendously important because
animal research is an essential
indispensable gateway to human clinical research.
If we if we decide to just move directly
from very early bench discovery to
humans I think we will be testing lots
of noise inadvertently on humans and we
definitely don't want to do that.
However you're all aware that for many
fields where we have hundreds if not
thousands of potential leads from animal
studies like neurological diseases we
hardly have any successes when it comes
to humans. For example,
dementia or treatment of stroke and we
have a couple of treatments for stroke
but we have hundreds of treatments that
seem to work in animal models and their
level. One might say that the explanation
for that is that animals are very
different from humans and I think that
there are differences I hope I'm a
little bit different than a mouse but
you know my genome is not that different
and I think that most of the problem is
not necessarily that dissimilarity of
the experimental system as the way that
research is done in ways that bias can
creep in. This is a slide from a paper
where we looked across the entire
literature that we could get on animal
studies and neurological diseases and we
found that very prominent signals of
excess significance bias you know
pressure to deliver statistically
significant results could be detected
across that field and once you started
sorting out different hints or patterns
of bias there were very very few pieces
of evidence that remained to be pretty
strong. Preclinical research, in the last
eight years we have seen rapid change in
our understanding about replication and
reproducibility. This time most of the
leads came from the industry and the
industry had every right to feel
uncertain about not being able to
reproduce papers published in top
journals by academic teams for for
academic investigators maybe this is
curiosity, maybe this is a paper in
Nature, maybe it is a way to get tenure.
For the industry it was an issue of
spending half a billion dollars on
something
that would lead nowhere. So we had
several companies that launched the
reproducibility checks on high-profile,
highly cited papers coming from academic
teams pretty much summarizing what they
had already been seeing in large-scale
most of the results could not be
reproduced in their hands
the reproducibility rates range from
zero to 25%. In one such famous example
where Amgen could only replicate six out
of fifty-three landmark studies for
oncology drug target projects Glenn
Begley concluded that "the failure to win
the war on cancer has been blamed on
many factors, but recently a new culprit
has emerged: too many basic scientific
discoveries are wrong." We have seen
similar reproducibility checks across
very different scientific disciplines.
Psychological science has also gone
through a major transformation this is a
paper that was published three years ago
in science a collaboration of 273
psychologists and there are teams trying
to reproduce a hundred of the creme de la
creme papers from their top journals and
you can read these results in different
ways but the summary, no matter how you
look at it, is that about two-thirds of
the time the original result could not
be reproduced and obviously the the
effect sizes were much smaller compared
to the original. What if you only read
Nature and Science? This is another
reproducibility check of 21 papers on
average the reproducibility effort
revealed an effect size of 50% of the
original and in many cases there was
nothing there, the effect was completely
on the vicinity of of the null with
pretty tight confidence intervals. Does
it mean that the reproducibility was
correct and the original was was wrong? I
mean all of these cases are efforts
where people systematically tried to
follow the exact recipe as the original
study and they even communicated with
the investigators, they even tried to
make sure that they have all the details
that would be critical. So there could be
many explanations. It could be that both are
correct, but for somehow we don't know
some reason why they disagree. It could
be that both are wrong, because again
there's some biases that are creeping in
that we're not familiar with. Or it could
be that one of them is not correct. But
clearly, if you have a situation we're
under the very best efforts to reproduce
you cannot get something to work again
one has to wonder would it work if we
were to use that widely, in real life, in
patients and communities in the real
world. Reproducibility efforts can be
tricky and they can also be emotional.
They can lead to what I call the the
reproducibility wars. The reproducibility
effort for example on cancer biology
started publishing a number of papers
were that very meticulous effort to
reproduce high profile cell by cancer
biology papers were attempted to be
reproduced everything had been
prespecified there were pre register
reports the results were very clear but
then most the time if the original was
not reproduced the original
investigators would fight back and you
end up in a situation where you feel
that reputations is at stake, there's
very fierce emotions about who is right
who's wrong,
careers are thought to be at stake, and
interpretations can be different on what
is successful and what is unsuccessful.
This means that people do care about
reproducibility, and they should care
it's really a central piece of the
scientific method. However, what exactly
do we mean by reproducibility? What is
research reproducibility? If you look
across all 22 major disciplines of
science there's a rapidly increasing use
of that terminology, if you do a text
mining exercise like what I'm showing
you here. But people mean very different
things.
Basically you can separate
reproducibility into three main clusters.
One is reproducibility of methods, which
is the ability to understand or to
repeat as exactly as possible the
experimental and computational
procedures. So availability of software,
of script, of data, and the ability to put
them together and get the same result
from the very same data. Reproducibility
of results, which means that we're doing
yet another study our new participants,
new samples, new observations and we hope
to get a result that is consistent
compatible ideally as close as the
original; and reproducibility of
inferences, which means that we have one
study a replication or multiple studies
or a body of evidence and I ask people
in the audience what do you conclude out
of this and we may or may not agree
about what these data and what these
results mean. Reproducibility can be
affected by the recipe of research
practices that we apply and you can
think of two extremes of research
practices two stereotypes of course this
is a stereotype it doesn't mean that I
have some particular researcher in mind
who's doing things wrong but if you
think of small data and big data this is
what often that looks like. So with with
small data, which is still the majority
in most scientific fields, we have the
prototype of the solo, siloed
investigator with a small team, very
limited resources working in competitive
environments where there's limited
funding, unavoidably the studies that
will be done will be pretty small. One
needs to be successful, you know these
three or four years of funding and and
you need to say I have something major
to say and something major that would
lead me to my next grant. Actually you
need to get going probably not after
three years but after three days to
write the next grant. And how do you do
that? Much of the time the results will
be quote unquote negative or
unimpressive you need to start exploring
searching whatever space you have
available.
There will be a lot of cherry-picking of
the best-looking results, a lot of
post-hoc interpretations, p-values of
just slightly less than 0.05 are
typically considered to be enoug,h and
with p-hacking it's almost always
possible to get there. There's no
registration because that decreases the
degrees of freedom. No data sharing
because it offers ammunition to
competitors and no replication because
if you try replication and it fails you
have invested twice the effort and
you're back to square zero and people
say you found nothing so that's that's
it for you. Some of the ways to improve
the validity of small sample size
research are very easy to implement. They
cost nothing. They would save us from a
lot of trouble. However, they are not
implemented. For example, we know about
experimenter bias for over half a
century now
Rosenthal published his papers more than
half a century ago and we know that in
animal experiments or in other in vivo
studies but also in vitro it's great to
have people reading the results to be
blinded to the experimental conditions.
However, that happens less than 10% of
the time. Randomization we know that it's
not about humans here it's about animal
work or or other experiments it costs
nothing it's very easy to do it would
save us from a lots of trouble of
imbalances that would be consciously,
subconsciously, or unconsciously
be created between the
compared groups again less than 30% of
the time even in the most recent studies
this is implemented. With big data we're
challenging the status quo of small
study research but it doesn't mean
necessarily that our odds of success are
necessarily better. In that situation
which is becoming more common with
electronic health records, with large
omics
databases, and other such efforts we have
extremely large sample sizes we have
over-powered studies that are likely to
give signals no matter what ,even if no
signals are worthwhile detecting
you will detect tons of them. This means
that again there needs to be a
cherry-picking process and again a lot
of that is done post hoc. Actually many,
most of these databases are not even a
sampled for research purposes they just
happen to accrue overnight. You know I'm
sleeping and I wake up in the morning
and and there's so many more patients
that have been added to the electronic
health records that one or more
researchers could tap into. Statistical
inference tools are a bit different
people recognized very quickly that if
they just use a p-value of less than
0.05
everything will be statistically
significant so there's a little bit of
more sophistication much of the time in
that space but very often you see
idiosyncratic statistical inference
tools without much consensus. People may
be working in the same or very similar
fields but nevertheless they're using
very different approaches and very
different thresholds for claiming
success. Registration remains an
exception also in that space and data
sharing does happen probably more often
than in fields that use small datasets
but unfortunately much of the time
there's no understanding about what
exactly is being shared. Most the time
these datasets are just dumped somewhere
with some axis where someone could
easily or more difficult get access to
them and then you ask does anyone know
what these data are and what they show
and and literally even the investigator
who has generated the data probably does
not know because they're just a big
black box that is very, very hard to
interpret what exactly has been
generated let alone how valid it might
be. Replication does happen in some
fields and I showed you some examples
and even when it does not happen we may
have a situation where people say I'm
doing a study that is somehow different,
actually there's an incentive to justify
that I'm doing a different study, but
then you look at the two studies and say
well you're asking exactly the same
question but to get it published someone
has to say it is different. Meta-analysis can take these
similar studies and examine them in
their totality, studies on the same
question with pretty much the same
comparisons with pretty much the same
interventions or the same risk factors
or the same hypotheses being addressed
and give us a sense of heterogeneity
between these different studies that
address fairly similar questions.
Heterogeneity may be genuine, much of
heterogeneity may reflect genuine
differences across these studies however
it can also point out two differences
that stem from biases that are
differentially expressed or affect
differentially different investigations
on the same question. If that's the case
you should be able to pick some hints of
the presence of a bias because it leaves
a pattern of a particular type of
diversity a particular type of
heterogeneity across the results that
are being combined in a meta-analysis. So
this is pretty much what we did here we
pre-specified 17 patterns of bias and we
thought how would the literature look
like if these biases were present and
for all the meta-analysis that we could
find across all scientific fields when
they were there when they were many we
just selected a random sample thereof, we
tried to ask do we see these patterns?
Most of these biases could be seen in
most fields, however, some biases seem to
be having more common hints of their
presence in some fields compared to
others. For example, small study effects
which is a pattern where small studies
give more prominent perhaps exaggerated
results compared to larger studies was a
pattern that was seen very prominently
in the social sciences were seen
prominently in the biomedical sciences
and was not really seen or very soft
signals were seen in the physical
sciences physical sciences are much
better used to working with large
datasets with communal science with CERN
or big telescopes sharing all the
information across all the participating
physicists or astrophysicists who are
working on a particular domain of
science. The same applied to other
disciplines sometimes one or another
field had more of a signal of one bias
or another but most biases were seen in
most scientific fields. What are some
potential solutions? There are many
solutions that are being proposed
they're integral to the process that we
do science and you can think of the
reproducibility problems as a central
concern in everyday experimentation, in
everyday study design, in everyday ways
of running science. This means that lots
of people are thinking about them and
some very smart solutions have appeared
but also many of them are very
speculative
they have no empirical support to tell
us that they will make things better
rather than make things worse because of
collateral damages that they may procure.
Here's 12 families of solutions:
large-scale collaboration, adoption of
replication culture as a sine qua non,
registration, sharing of data,
reproducibility practices and checks,
containment of conflicts and sponsors
and authors and other people involved,
more appropriate statistical methods,
standardizations of definitions and
analysis whenever this is possible, more
stringent thresholds for claiming
discoveries or successes, improvement of
study design standards, improvement in
the peer review reporting and
dissemination of research, and finally
better training of the entire scientific
workforce in methods in the scientific
method and how it should be applied and
statistical literacy and numeracy. I hope
I have convinced you by now that some of
the most successful fields in terms of
credibility of those that have adopted
large-scale collaboration and adoption
of replication culture. This is
example of Manhattan plots, genetic
epidemiology was transformed from a
field that almost nothing could be
replicated to a situation where we can
pretty much safely reproduce signals. I'm
not delving on whether these signals are
useful you don't know whether that's
something that you can take to patients
and really change their lives but at
least they're there and sometimes the
signals may be true but maybe they're
not going to be useful this is perfectly
fine you know we know what we're dealing
with we know what is the true complexity
of the research questions that we're
facing. Registration can be tricky in
many situations the least that I would
like to see is people who are doing very
interesting exploratory research feel
that they need to say that I have
registered my study. If you wake up at 3
o'clock after midnight with a fancy idea
that you cannot go back to sleep and you
need to run to the lab to try it out and
you're just messing right and left with
different possibilities, probably you
didn't write a protocol before you
started doing that. But you know
something very interesting may come out
of it and maybe you are the next
Alexander Fleming with with penicillin
how likely is that to be the case, not
very likely but there are 20 million
scientists offering scientific papers so
a few among those 20 million will be
Alexander Fleming hopefully. What we need to do in that case is just say that that
was exploration it was mad, wild, crazy,
exciting, fascinating exploration. That's
what came out of it now someone needs to
reproduce it in a prospective
reproducibility effort. Level 1 would be
registration of a data set. I feel very
uneasy when I'm working in a field that
I don't know how many data sets are out
there that could be probed and in how
many different ways they can be probed.
Registering a data set is like
registering a nuclear arsenal so I'm
telling you that I have
this big data set in my computer it
includes observations on two million
people and I have two thousand variables
on each one of them which means that
tonight if I feel depressed I can press
a button on some statistical software
and launch so many billions or trillions
of Chi square P values against you. So
it's it's one way to confer the the
breadth of the possibilities of analysis
that can be done.
Level 2 would be registration of a
protocol. If a protocol exists, there's
lots of research where there's no
protocol and sometimes it's just blue
sky exploration but most the time a
protocol is feasible and should be there
before something is done. Time spent on
coming up with a protocol I think is
always well-spent. Level 3 would be a registration of the
analysis plan if there is an analysis
plan sometimes you have to acknowledge
that this is the analysis plan that I
could think ahead of time and these are
some modifications that arose because of
some peculiarities that I had not
anticipated. Level 4 would be
registration of both the analysis plan
and the raw data, and Level 5 would be
open live streaming where you iteratively
communicate with the rest of the
scientific community about what are the
experiments that I'm planning to do, one
receives feedback, they revise, they're
done, they're shared again and this is an
iterative open process with the whole
community. This is pretty much how the
claim by NASA that they had identified
bacteria using arsenic instead of
phosphorus for their DNA backbone was
refuted was a paper in Science. In in the
absence of having some rules in the
science game in many fields any result
can be obtained what you get is what I
call extreme vibration of effects and at
the extreme you get the Janus phenomenon
being a Greek Roman god who could see
into opposite directions.
These are some data from the National
Household Survey extremely meticulously
collected information and if you focus
on the right panel
this is hazard ratio and there is a
horizontal axis minus log 10 p-value on
the vertical axis, this is the
association of alpha tocopheryl or
vitamin E levels with the risk of death
and there's 1 million points on that
plot, there's 1 million different results
that one can get in the very same data
set on the very same question just
analyzing the data slightly different
For example, death can be affected by
zillions of other factors, so if you
count for a factor or not you have two,
choices if you have 19 such choices to
make this is 2 to the 19th power this is
1 million different possibilities of
analyzing the very same data on the same
data set on the same question. 70% of the
time vitamin E decreases the risk of
death, 30% of the time vitamin E
increases the risk of death. If I have a
strong belief on what vitamin E should
do, I can get that result no matter what.
This is really happening on a daily
basis it is happening on some of the
most influential papers that you will
see in the literature. This is a paper
that last year was one of the 20 highest
impact papers across all science and
practically it concluded that with 3
cups of coffee per day your risk of
death decreases by 17%.
If these wonderful colleagues could give
me their data set I can make it be that
your risk of death will increase by 17%;
it's it's an open pledge. Transparency,
how can we find out that we can even
trust the data that there's not an a
missing from transparency for example. We need to find ways to make sharing easier.
This is a pivotal study, study 329
it sounds like a submarine U329 but
actually it's a randomised trial and
when it was done and published in 2001
by Smith, Kline, and Beecham
it showed that paroxetine and imipramine
are very effective and very safe for
major depression and adolescents. 15
years later the very same data set was
reanalyzed by independent, non-industry
affiliated investigators and they
concluded that paroxetine and imipramine
are not effective and they are not safe
for major depression in adolescents. How
often does it happen? A few years ago we
looked at all the reanalysis that had
been done on the same clinical question
from the same clinical trial data set.
These are reanalysis rather than
replications but if we cannot reanalyze
and have some confidence why should we
even take the second step of doing it a
separate independent replication on a
new study. We found 37 such reanalysis,
and 35 percent of the time the
conclusion the main conclusion of the
reanalysis was different compared to the
conclusion of the original paper. This
treatment should be used,
no this treatment should not be used it,
should be used in this subgroup, no it
should be used in that subgroup. Was it
research parasites who had run these
reanalysis? Was a rogue analysts who
were trying to make a career of
themselves or put you know the great
original investigators into shame?
Almost always it was the same original
investigators who published the
reanalysis, but they published it in an
environment in a publication environment
where if they were to say that I tried
to reanalyze my data and I found exactly
the same thing and I conclude exactly
the same conclusion someone will tell
them why did you waste your time you
know this is duplicate publication. So
our incentive system selects for
confusion, selects for discordant results
even for things that are done by the
same investigators. Recently we revisited
that pattern
looking at another extreme where
reanalysis actually should have been the
default. PLOS medicine and BMJ have
policies in place that if you publish a
randomized trial you need to make not
only the full protocol but also all the
raw data available to anyone who would
ask for them. So along with Floria
Florian Oded and other colleagues from
my METRICS team we invited investigators
who had published under this policy to
share the data with us and we promised
that we would reanalyze them at no cost
for free. It's a little bit like getting
an invitation from IRS but I'm I'm
really glad that almost 50% of of these
investigators actually did send us their
data and they also were very helpful
with that and re-analyzed the data. We
found a few errors but nothing major the
conclusions will still remain the same.
So you have two extremes one is a highly
selective environment where people are
just trying to impress and share very
little and the other a very selective
also environment where everyone is
willing to share, at least 50% are
willing to share, and they're very open
to having their data reanalyzed and then
everybody looks fine. One might argue
that maybe these people who did
contribute their data too can extra look
and if they found any major errors they
made sure that they send us a version
that will be compatible. I don't want to
become paranoid I believe that if we
create the culture of sharing most of
the results that we will get if people
are well trained hopefully we'll be
reanalyze-able and reproducible at that
level at least. So how do we improve
sharing? Sharing is a challenge in itself
and when we tried to remove to retrieve
the most highly cited papers' datasets
the raw data behind the most cited
papers in psychology and psychiatry we
met with quite a lot of resistance.
You know these trails and these
studies were not bound by the PLOS
Medicine and BMJ
rules so they could or could not share
might not share their data with us we
realize that in some cases they had made
their data available already and in few
more they were willing to share that
information but very often they could
not or did not. What were the reasons for
that? The the most common reason was that
it was outside of the researchers
control I have come across many many
situations where researchers do not
control their own data. Sometimes
researchers published as first or last
authors are both and they have never
seen the data that they put their names
on it's it's really scary. Legal and
ethical concerns, you know the consent
did not allow that and that was done in
the 1990s or so. Preparing our own
sharing system, the data no longer exists--
the classical example in the literature
the data has been eaten by termites it's
a famous quote--insufficient resources or
researchers are still using data and the
question is for how long for six months
for two years for 20 years. Are we making
any progress in sharing? We are a few
years ago along with Muin Khoury and
Sherri Schully from NIH we looked at the
reproducibility research practices and
transparency practices across the
biomedical literature and we found that
hardly anything was being shared between
2000-2014 there are some niches like
genomics that are doing this but in the
big view of zooming out that was very
very uncommon. Conversely when we looked at 2015 to 2017 there was real progress
and in some cases that progress can be
explained because journals did change
their policies. For example in psychology,
with that revolution of reproducibility
some journals switched to encouraging
routine sharing with a budge system and
you hope you see a rapidly increasing
rate of sharing data sets in the
published papers in these journals. But
in biomedicine I think it has been more
of a diverse movement where multiple
journals, multiple fields, multiple
investigators, multiple institutions are
incentivizing or facilitating sharing
so the rate has gone from literally
close to 0% to something like 25 percent
by 2017. We also see some other
concomitant changes in transparency
indicators. For example a disclosure of
funding or disclosure of conflicts of
interest has gone up over the years. Most
people still claim that they have
something novel to say even if you look
at the abstract it's very prominent but
there are more people who say that I'm
trying to replicate something or at
least trying to do something of a but at
the same time replicate existing
knowledge. Computational methods can
facilitate much of our reproducibility
quests, a couple of years ago we
published these guidelines trying to
enhance ways that journals, investigators,
and institutions could move forward in
improving their reproducibility for
computational methods including software,
script, and linked to data very often you
see a link and you click on that link
and there's nothing there you get an
error signal so there's some very easy
interventions that can improve the
availability of these functional links
but there's also some more sophisticated
ones that can take us substantially
further. Better statistics and methods. We
have seen a transformation of research
over the last several decades and data
science has taken a central role across
multiple scientific fields. Is that new?
Well science has always been about data
but I think that while in the past it
might have been easy to work with small
datasets for descriptive purposes
currently you really need a license to
kill, a license to analyze, and most the
time most of the papers that I see
probably are done by investigators who
don't have that license to analyze it's
very uncommon to see transparent
statistical analysis plans we repeatedly
lament that we don't invest much in
training our investigators in statistics
and issues of design and
good study designs are underutilized. I
mentioned the very simple choices of
randomization of and blinding of
investigators that are so underused. How
do we do that? Do we just use some
checklists? Do we add another level of
bureaucracy for example and we say well
fill in that checklist? Is that enough? If
someone has done something in a way that
is very substandard if it's really silly
how likely is it that it will be uh
acknowledged rather than just someone have
checked that checklist so yeah I did
that I'm okay it's it's a it's a
continuous tension. Are we using the
right statistical methods? There are many
pleas to try to change the way that we
make statistical inferences and I'm not
going to spend much time on them because
each one of them has a different
philosophy in mind so one such plea is
to become more stringent. Many fields
that have improved their track record
did become more stringent, genetic
epidemiology move from p-values of 0.05
to a genome-wide significance level of
10 to the minus 9 and you know things
seem to be working much better in terms
of reproducibility. For many fields that
are working with p-values of 0.05 a very
simple move would be to shift to 0.005
for statistical significance and that
would immediately eliminate probably a
very large segment of noise but that
would also take away some genuine signal
so this needs to be balanced in each
field in terms of whether it's a good
idea or not. Most fields are still using
null hypothesis testing but probably
this is not a good choice for many
applications including for example
developing a prognostic score or
assessing a diagnostic test or
evaluating a therapy or mining
electronic health records or mining big
data. We need to find statistical methods
that are fit for purpose and very often
these may be bayesian this may be false
discovery rate based sometimes they may
well be frequentist. Who's going to do
that? We don't really invest on training
our investigators and retraining them
with
investment continuous education on
statistical methods and design issues we
we just try to catch up with the next
tool or next technical tool that may be
available but not the core of the
scientific method. Conflicts of interest
I think that there is improvement in
transparency of reporting conflicts of
interest but very often I wonder is
transparency enough. Can we leave the
generation of original knowledge and
replication to conflicted stakeholders?
Who are the stakeholders who should be
running sensitive studies like
randomized trials, meta-analysis,
cost-effective analysis, guidelines? NIH
has been shifting away from most of
those
and there's some empirical evidence that
if you look at large randomized trials
supported by NIH, before registration a
good proportion of them were
statistically significant in their
results after registration this happens
very very uncommonly. Should public
entities like NIH resume and expand
their role about supporting sensitive
research where conflicts really need to
be avoided thoroughly if we want to have
full trust in them? Should we wait for
perfection? No. Science will never be
perfect. It's the best thing that has
happened to human beings, to Homo sapiens
sapiens but it's a process in evolution.
We need to use the best science that we
have so very often I hear okay let's get
rid of all that junk that horrible
research and we know nothing; that's not
true you know we have lights, we have
this wonderful amphitheater, the
projector is working, I can move my
slides, lots of things have happened and
I think that we need to defend science
and there will be many anti-science
voices that are trying to dismantle the
scientific effort. There are two ways
that we can go about it one is to say
that science is perfect and probably
that's not going to leave us very much
room to play because very quickly we
will meet with the scientific method
itself which says that science is
falsifiable and very often we may be
wrong. Or that science is our best
shot and I think that this is what we
need to defend sometimes our best shot
probably will have lesser credibility
compared to other times. We know with
high certainty about climate change, we
know with high certainty that tobacco is
going to kill people, we know far less on
whether broccoli is gonna make me live
longer. So we need to be transparent
about what we know and what we do not
know. We need more research on research
and I'm clearly biased here because this
is where I am investing my effort so
probably it's like I'm asking for more
funding but I think that we need to
study how exactly to evaluate our
research practices. We need to find ways
that you can refute everything that I
told you today with more empirical data
and with better science. We need to find
how we can best perform research,
communicate research, verify research,
evaluate research, and reward research.
What scientific workforce and what world
of science are we envisioning? We can model
that we can model science in 2030 or
2040. Is it going to be accurate? Well I
think that we may well be wrong but this
is one such model of science in the
future
we used 11 equations to strive to
describe a simplified universe of
science where you have three types of
scientists: you have the diligent
scientists who are the majority,
you have the careless cohort of
scientists who might be cutting a few
corners or maybe they're not very well
trained and they may be following some
suboptimal research practices. How many
are they?
There are many surveys about that when
you ask people are you cutting corners
the answer is usually no when you ask
people do you know other scientists in
your environment who are cutting corners
the answer is almost always yes but I
think that let's say these are the
minority and then you have the unethical
cohort which are clear fraud, you know
creating data that don't exist, you know fraud
is is very very uncommon we're talking
about less than 1%.
If you incentivize these cohorts with
the same incentives if you discover
something if you make a claim you've
published a paper in Nature you will be
fine and you don't have differential
incentives based on their
reproducibility what you will get is
that the unethical and the careless
cohort will take over. The reason for
that is that they can get there faster
with fewer resources with cutting
corners. So we need to re-engineer the
reward system and I will leave you with
a couple of slides on how we do that. We
try to incentivize productivity and
productivity is wonderful but we need to
think about the whole electrocardiogram
we need a thing of P-Q-R-S-T which is also
quality, reproducibility, sharing, and
translational impact. We need to find
opportunities to change the way that we
do science in our everyday environment, I
think it needs to be a grassroots
movement we cannot just wait for some
king of science to impose with his or
her authority or queen of science what
exactly should be done
I think it's scientists who realize what
makes their scientific work more
credible more reproducible more
applicable. We also need to convince
other stakeholders and they may have
other priorities as scientists
scientists want to publish a lot they
also want funding for their work but
there's also the industry there's
private investors, public funders,
not-for-profit funders, editors,
publishers, societies, universities,
research institutions, non scientific
staff, hospitals, insurance companies,
governments, federal authorities, people.
Some of them want to see papers others
want to see funding, some want to see
things that work, and others want to make
profit. It's all fine but it needs to be
integrated to try to get the best
possible science. To conclude I think
that the presumed dominance of original
discovery over application is an anomaly
it may happen occasionally but it's
really the exception. Original discovery
claims typically have small or even
negative value and science becomes
worthy mostly because of replication. The
reproducibility and the usefulness of many
disciplines of scientific investigation
has substantial room for improvement,
this doesn't seem that science is not
good it's the best that that can happen
to humans but we can really make better.
There are many possible interventions
that may improve the efficiency of
research practices and the
reproducibility and utility of the
evidence and transparency, openness, and
sharing is likely to help but details on
how to can be important and we need to
find out about how these details play
out in different settings. A million
thanks to all of you for listening and
special thanks to a number of
collaborators that have joined forces
with me over the years to generate some
of the empirical evidence that I shared
with you today. Thank you.
You're one of the biggest fans you have,
and my lab is definitely you know
reading your papers. I think it's
absolutely true that incentive system is
completely wrong, science as you showed
it it's really a system, right? And it
works based on the incentives that are
put into it. I have recently tenured here,
in ten years of my tenure track nobody
ever gave me any credit for
reproducibility, almost when we present
data at the conferences I think we are
enemies of majority. When we publish in
the, I mean my lab probably hasn't
published study where we didn't include
independent validation cohort in the
past five years,
we usually get from reviewers oh they
should have second independent
validation cohort right when you know 90%
of everything that is published or
presented it doesn't have even the first
one right? So unless we change the
incentives unless we use for example
these H factors that factor in what
kind of methodological you know advances
we have used or or this checklist right
what we did, whether our work is
reproducible irrespective of whether
it's positive or negative the science
will continue the way it is continuing.
I fully sympathize with with what you
described and I think that this is a
feeling that many scientists working in
very different fields convey to me all
the time, but I think that there is
progress I think that there are fields
that you have a dominant view that this
is important to do and I think that
these fields are likely to be more
successful in the long term and it's not
going to happen overnight. I think that
if we focus on training younger
scientists
hopefully retraining some of the older
ones about what really matters and and
why are we doing this and and why it is
important to do good science rather than
just sloppy science. I think this is
like an everyday effort and everyday
struggle it's not one course that one
will take because I sometimes hear the
question, is there one course that I can
take? No this is science in in continuity
it's it's your everyday scientific
method in its application and and
there's no single course that can
replace your your experience as a
scientist. Thank you for that talk and I
agree it's going to take a cultural
change to move forward considering we're at
NIH and there might be a lot of program
stuff and in the audience you have any
specific suggestions on the funding side
as the major funder of biomedical
research? So I think all of the
suggestions that I made clearly applied
to NIH and NIH has tremendous power to
facilitate some of these principles and
I think that NIH has moved in that
direction on many fronts, probably not
all the movements have been equally
evidence-based, but I think that there is
people who are receptive to that message
and they do want to to change the way
that things are done. Clearly if you have
NIH giving priority to these incentive
structures giving priority to openness
and sharing and focusing on good work as
opposed to just quick and quote-unquote
successful it can be a tremendous impact.
Yes there was a high-profile case of
Theranos where statistic
statisticians called them out and said
that this was not going to work. Theranos
in terms of the biotech company, and so
this was done with incomplete knowledge
nor statistic
data from the company so the question is
what does gut instinct tell you of a
reviewer with respect to how something
is going to turn out? And it would be
interesting if you did took the gut
instinct of a reviewer and saying I
think this is going to turn out to be
profound
versus this has turned out to be fake
regardless of the underlying statistical
value. Could you come comment on that?
Because I think that would be really
important in terms of biotech IPOs. So
transparency applies just as well in in
the case of biotech and startups and
unicorns. Actually I was the the first to
write a critical paper on Theranos in
JAMA about a year before John Caillou
started publishing his Wall Street
Journal investigations and I sent that
to Jam in 2014 it went through not only
review but also legal review because I
was challenging the highest valuation
startup in the country that was visited
by vice presidents and all the the
influential people were and on the
advisory board and everybody was so
excited and I was saying they have no
evidence to support what they claim I
don't see any paper that they have
published I I think their valuation is
nine billion but it could be just nine
dollars and the the paper was published
eventually in in JAMA I got a lot of
pushback at that time I heard from their
general counsel asking me to recant and
that they were getting FDA approval when
they did get their first and single and
only if they approval all of that is way
before the the Wall Street Journal I was
told again to recant and write an
editorial with their CEO that I was
wrong and I remember Washington Post
writing a story like the the insanely,
influential Stanford professor who's
asking too much of Theranos and you
know what a shame you don't understand
basic things this is how innovation is
happening. People now think that Theranos
was an exception. you know it was
fraud and everybody's fine otherwise. I
don't believe that. A couple of months
ago we published a paper where we looked
at every single uniform in the health
care space the majority of them looked
pretty much like Theranos in there in
terms of their transparency and
availability of published, peer-reviewed
science. So I'm not saying it's fraud but
unless we improve transparency and
science for for these entities
I think that we are running the risk of
having deja vu Theranos times two times
three times four in the near future.
Thank you. Hi, thank you very much for
that talk. I'm Ian hutchins a data
scientists here at the NIH I spend a lot
of time developing and using research
assessment metrics for portfolio
analysis and one of the things that I've
observed is that there seem to be a lot
of cultural and policy barriers to
reproducibility efforts for example
journals will often have in their
editorial policies that they won't
publish replication studies I think even
PLOS One until very recently had that
policy in there and I haven't looked
specifically but one imagines that
applications for funding that focused
specifically on replicating a previous
study don't fare particularly well in
the in the general peer-review system. So
how do you think that that logjam can
best be broken up? So I think we need
more training and we need more people to
understand what what is at stake here. I
believe that many journals have changed
their stance over time and many of them
are very sympathetic, many fields that
have adopted replication massively like
genetics would not allow you to publish
something unless it's extensively
replicated so it's become a  sine qua non
it's the most integral part of the whole
world there's many other journals that
are probably lagging behind and we need
training in in a recent study where
editors-in-chief of respectable journals
were asked to perform an investigation
about a third of them could not even
tell that a study was a randomized trial.
So the methodological is often lacking
and we need to just keep pushing that
message and educating and training
people to to be more knowledgeable about
how things are working in in the way
that we do science. I think that the best
journals will be the ones that publish
the best science eventually and I think
that these will survive. I'm thinking in
an evolutionary mode in that regard I
don't want to think that the best
journals are going to disappear and
we'll just get more and more noise. Thank
you.
So we've been told we need to exit the
auditorium now for another group coming
in however please we have people in line
to ask questions come join us in the NIH
library and ask your questions there.
Thank you.
