JONATHAN SHEFFI: Thanks so much.
Good afternoon, everyone.
AUDIENCE: Hello.
Hi.
JONATHAN SHEFFI: They
tell me the first rule
of public speaking is to make
sure the audience is awake,
so I just wanted to test.
I know we're late
in the day here.
But thank you so
much for coming.
And please, of course, at
the conclusion of this talk,
don't forget to fill
out your feedback.
It's right in the
Google Cloud Next app.
So, my name is Jonathan Sheffi.
I am our Senior Product
Manager for Life Sciences
at Google Cloud.
Now, over the last
couple of days,
you've heard from
a lot of amazing--
heard about a lot of the amazing
technologies at Google Cloud
that we're developing that
span across the gamut.
You've heard about advances at
BigQuery, our containerization
technologies,
compute technologies.
We also have a set of
teams that are really
focused on verticals--
focused on advancing
specific industries.
So today, I'm going to tell you
a bit about the organization
that I'm a part of, the Google
Cloud Healthcare and Life
Sciences team.
Tell you, specifically, about
our genomic data processing
solution and what that
entails-- the set of products
and partnerships that
make that come to life.
And then I've got a couple of
special guests, so Anita Nandi
from Oxford BDI--
Oxford Big Data Institute--
will be joining to
talk about her work
with the Mosquito Data Atlas.
And then Vivek Iyer who is from
the Human Genetics Informatics
team at the Wellcome
Trust Sanger Institute
will be talking about his work.
So, at Google Cloud
Healthcare and Life Sciences,
we are thinking about a
lot of interesting problems
because there's so many great
data problems in this industry.
So, not just life sciences,
but also clinical data sets.
You may have heard a bit, if
you stopped by the Healthcare
and Life Sciences
industry booth earlier,
you may have seen about our
de-identification API, which
makes it really easy to strip
out PHI from your EMR data
or medical imaging data.
You may have heard about
our Healthcare API, which
makes it really easy to ingest
EMR data from any Fire or HL7
V2 endpoint.
And also our DICOM API,
focused on medical imaging--
makes it really easy to extract
metadata from medical images
and work with it
in Google Cloud,
enabling our customers
to do analytics
on millions of patients.
So, our team is composed
of product managers,
such as myself, a large
set of software engineers,
our solution architects who
help our customers bring all
this to life and
get it deployed,
and our commercial team,
who brings deep domain
expertise from the healthcare
and life sciences field.
Most of us come from the
industry, myself included.
I, myself, was really
originally trained
as a computational
biologist, spent many years
at some of the pharma
companies in the industry.
So, in life sciences,
it turns out
the problems don't stop
just at the wet lab.
The amount of data that we're
seeing in the life sciences
space is exploding.
That's kind of not
news to anybody
who's come to this talk.
Hopefully, you've
probably heard about--
with the decreasing cost
of whole genome sequencing
the amount of data
that's being stored,
processed, analyzed,
and managed is growing--
doubling-- every year,
every seven months.
That creates new problems for us
to help solve as Google Cloud.
More importantly,
the applications
are really important when it
comes to drug development.
How do we make this
happen in the real world?
Genomics is quickly becoming
very much a part of the drug
development process.
Francis Collins predicted
almost 15 years ago
that most drugs would be
coming from a understanding
of human genomics.
We're about up to
a third these days,
and that's really
quite something
to know that our understanding
of the human genome
is already informing therapies
that are being approved now.
So, that brings us to our
genomic data processing
solution.
We see lots of organizations
who are generating data off
of DNA sequencing instruments,
and they need the storage
and compute
technologies in order
to process this data at scale.
They need to be able
to ingest their data,
manage workflows, and control
access to the data sets,
and be able to run those
workflows over and over again
on millions of samples.
So, a great example of
this is our collaboration
with the Broad
Institute back home--
or my home-- in
Cambridge, Massachusetts.
So the Broad Institute currently
manages about 60 petabytes
on Google Cloud.
They process about
600 genomes every day,
and have now processed
on Google Cloud well
over 100,000 whole genomes.
And the way that they do this
is through our genome processing
solution, which includes both
products and partnerships.
So, this is a diagram of
the human genome sequencing
workflow that many of
you may be familiar with.
So, in order to deal with
these defined workflows, which
is what's usually
used, you may have
heard of the Genome
Analysis Toolkit, which
was pioneered by Broad.
You may have heard
of Deep Variant.
Deep Variant is a machine
TensorFlow-based variant
caller.
So, as opposed to using
Bayesian inference in order
to call variants
from mapped reads,
instead it used-- it treats
the read pileups as an image
and does machine learning
on that image every pileups.
And the accuracy numbers
we're seeing from Deep Variant
are very exciting.
I definitely encourage you to
check out the publications.
As that data has been processed
to BAM and VCF formats,
there are also leads
and opportunity
for tertiary analysis
and population genomics.
This is also an area--
I'm not going to talk too
much about that today,
in the interest of time--
but we're doing a lot
in this space, as well.
How do you deal with your
process of genomic data
and do analytics on it?
We see customers in clinical
labs and research settings.
We're using the
tools we've built
to ingest their VCF, or Variant
Call Files, into BigQuery.
And BigQuery, it turns out, is
a pretty amazing variant store.
And they're joining
their variant data
to their clinical data
to their phenotypic data
to their medical imaging data
and doing analytics, performing
machine learning, and making
scientific discoveries
using the power of BigQuery.
There we go.
So some of the
benefits here-- what
makes Google Cloud ideal
for genomic data processing?
Very flexible pricing--
one of the features
of our core product, which I'll
talk about in a minute, Cloud
Life Sciences-- is
flexible machine types
and highly optimized pricing.
We make it really easy to
spin up a lot of containers.
Basically, that's what
Cloud Life Sciences does.
I'll talk about that
more on the next slide,
but Cloud Life Sciences--
essentially, you
go to Cloud Life
Sciences and say,
I've got 30,000
Docker containers
from all the different
containerized workflows
I'm running.
Please run them all.
Find the right machines,
machine shapes--
mount that Docker
container and the store--
and hook it into storage,
localize that file, process
that file using
the relevant step,
and return it back to storage.
We also provide important
security controls,
a lot of important
workflow engine partners
and other technology
and service partners,
and access to key data sets.
I'll talk a little more
about that in a minute.
So, some of you may
have heard of what used
to be called Google Genomics.
And Google Genomics is great,
but we've actually taken things
far beyond that
now, and that's why
we've re-branded this as Google
Cloud Life Sciences and added
features.
So now, Cloud Life Sciences
is fully regionalized.
What I'm excited to
be able to announce
on this stage is whereas
Google Genomics had been
a globalized service,
Cloud Life Sciences
is now a regionalized service.
So, as we moved into beta,
which just happened last month,
we launched in Iowa.
And this quarter, we're
launching right here
in our London region.
So this gives the
power for organizations
that are based here in the
UK to process their data
and know that the data, the
compute, and the metadata
are all staying in
the London region.
We'll also be launching to
other regions in Q1 and Q2 2020.
We're really excited about
the global rollout here.
So the benefits of Cloud Life
Sciences for life sciences
organizations, for academic
research institutions--
the ability to build
on open standards
and containerized workflows.
So, some of you may be familiar
with open source workflow
engines such as Cromwell,
Nextflow, and Galaxy.
These have all been
made to work for you
on top of Cloud Life Sciences.
So you're able to use
workflow engines that you're
familiar with right away.
We're also working with
proprietary partners,
so SAS partners, who offer
researcher-friendly interfaces
on top of Cloud Life Sciences
and Google Cloud and Google
Compute Engine.
So Terra, Seven Bridges, BC
Platforms, and many other life
science research platforms
that you may be familiar with.
For researchers,
they gain the ability
to deal with that data
in a friendly format
and get to results,
get to answers fast.
And for IT
organizations, they know
that they have a solution
that's safe, secure, and can
scale with their organization.
I talked a bit
about our partners.
I also actually want to
talk a little bit more
about some of our
data sets, so--
oh, actually-- yeah.
I want to talk a little more
about some of our data sets.
So, in 2018, we announced a
very important partnership
with the National
Institutes of Health.
There's a program that we
announced called the STRIDES
program.
It stands for something,
I can't remember
what the acronym stands
for, but what it does--
we've reached an
agreement where we
are onboarding the most
valuable biomedical data
sets into Google Cloud.
This includes the Cancer Genome
Atlas, the Cancer Imaging
Atlas, Top Med.
The Sequence Read
Archive-- we recently
have now made available
the entire public portion
of the Sequence Read
Archive, and we will soon
be making available the
controlled access portion
of the Sequence Read Archive.
And the important thing is
that these data sets are now
available to
researchers at no cost.
So, especially in the
controlled access setting,
we've also done an
important engineering work
to integrate the authentication,
authorization, and workflows
you're used to with the NIH.
So, we know when Professor Jane
Doe logs into her Google Cloud
account, we are able to
associate her identity
in Google Cloud
with her identity
at NIH, and soon also at ELIXIR,
and therefore, provide access
to the right data sets
to the right researcher
at the right time.
So, we're changing a flow,
which many of these you
may be familiar with, of
filing data access requests,
going to the dbGaP
portal, fumbling
with the credentials and the
login, going to an FTP server,
downloading that data, finding
space in the data center
or in a cloud account before
I can even get to work.
It adds a lot of delays
to the research process.
We're changing that flow
and replacing it with login
to Google Cloud and get to work.
We think this provides a
lot of important benefits
for the research community.
Of course, for the
researchers, it
makes it easier to get
started doing your work.
It provides a very
important benefit
for the research
computing leaders
who support these researchers.
I can't tell you how many times
I've been in a meeting with
the research IT director
and a set of researchers--
whether an academic institution,
or an independent research
institution, or a
commercial institutions,
such as a pharma company--
and we find out in the
meeting with three researchers
and the research IT
director that there's
10 copies of the
1,000 Genomes Project,
eight copies of the
Cancer Genome Atlas,
five copies of the
Cancer Imaging Atlas,
and we find out that there's
all this multiplicity of data
that's being stored
in the existing data
center or possibly the
existing cloud account.
And so now we're taking that
problem away and saying, well,
why don't you just login,
get started on Google Cloud,
and get to work.
This is just the
beginning of the data sets
that we're making available.
You'll see a lot more
of these in 2020.
We're investing heavily
into this program.
So I want to talk
a bit about what
does this look like in action.
We're very excited
about this project.
So recently, Autism
Speaks, which
is a major patient advocacy
group based back in the States,
worked with DNAstack, one of
our Google Cloud technology
partners, Sentieon, another
Google Cloud technology
partner, and Verily, which is
another Alphabet company that
is involved in
biomedical research,
and along with Google Cloud on
a very important new project.
They perform alignment
and variant calling on
over 11,000 whole
genomes of people
with autism and
their family members.
They discovered many new
important genetic variations
associated with autism.
And importantly, through
the sophisticated use
of Google Cloud and our
preemptible VM technology,
they reduced their cost of
processing to less than $3
a sample, which is
pretty remarkable.
For anyone who's been in
the bio-informatics industry
for some time, it
wasn't that long ago
that processing a
genome cost $50, $100.
And through advances in
compute, through sophistication
in workflow execution, we've
brought that cost down,
and we continue to invest
in reducing it further just
to enable the scale of research
that everyone wants to see.
Importantly, the
MSSNG project, which
is the name of
the data set, even
created an interactive
data discovery portal
for researchers to go and
use the data themselves.
They've gathered not only the
genomic data but clinical data,
as well.
And as I mentioned, they're
using Google BigQuery
under the hood in order to
power this analytics engine
so researchers can go ahead and
get started using this data set
and exploring it on their own.
We're also seeing
a similar project--
I'll just mention this as a
bonus here, I don't a slide
on this--
anyone who was at the ASHG--
the American Society of Human
Genetics-- meeting in 2018
may have seen
announcement from Color.
Color is a genetic testing lab
based in the States, focused
on cancer, and they have
processed tens of thousands,
I think--
it was about 50,000 samples--
and said OK, we've got all
these samples and a good chunk
of clinical data from
the physicians ordering
the reports.
We want to understand more,
and potentially expand the test
and make discoveries
on this data set.
Well, they were able
to use BigQuery.
They ingested this
genomic data into BigQuery
alongside their
phenotypic data, which
was in a Postgres database.
And then for their
company Hack Week,
they stood up a bunch of machine
learning tools and analytics
tools on top of
that joint data set.
They actually ended up coming up
with a very sophisticated model
of clinical disease that
outperformed existing
clinical models of disease.
They also found out that their
test predicted a biomarker that
has nothing to do with cancer.
They won't let me tell
you which one it is,
but they're very
excited about it
and may be publishing
that at some point.
So we're seeing organizations,
both commercial and
independent, who are using
the power of analytics
to make true
scientific discoveries.
So, Thomas Frazier, who's
the chief scientific officer
for Autism Speaks is--
I think this says
a lot about what
we're trying to do as Cloud
Healthcare and Life Sciences,
but with this research,
enabling researchers
to identify genetic
influences on autism
that can lead to
tailored medical care
for those individuals.
And really getting
to the opportunities
for true improvements
and treatments
and diagnosis is what
we're here to accomplish,
and highlights our mission
as an organization.
So with that, I'll,
say thank you so much.
And now-- I'm just
the warm-up act.
Now you're going to get
to hear the people doing
the real work, some
of the researchers
who are using Google
Cloud today to do science.
I've got a few minutes left.
I'll take a few questions now
before I turn things over,
and then we can also
do some Q&A at the end.
I've got about eight minutes
for questions, it seems.
Not where I want to land.
Stunned, I know.
You're so upset.
Oh, right here.
AUDIENCE: So when you
showed that number
about the processing
cost coming below $3
in the preemptive settings,
how long, roughly,
does it take to
process a whole genome?
JONATHAN SHEFFI: So, it depends.
And you mentioned--
part of that's
because you mentioned
preemption, right?
And so, for those folks who
are not familiar with that,
I just want to explain
a little bit what
preemptibles are because
that affects the answer.
So, preemptible virtual
machines are compute instances
that are offered to you at a
substantial discount, roughly
80% off, in exchange
for the fact
that they might be
taken away if there
is a spike in demand
from folks who are using
non-preemptible instances.
So, for this
particular project, I
don't know the average
turnaround time,
although I'm happy to
find out and follow up.
There's nothing
confidential about it.
What we see often in practice--
I've seen for GATK can be
on the order of 12-24 hours.
It's on the longer side, partly
because preemption sometimes
delays the process.
We've also seen
some implementations
that speed it up quite
a bit and get it to
on the order of three hours,
or sometimes even less.
I'm not giving you, I think,
the exact answer you might like
and the short answer
is, it depends.
But I would be happy
to discuss with you.
AUDIENCE: These
are human genomes.
JONATHAN SHEFFI: These
are human genomes.
All we're talking
about today-- we're
looking at human genomics.
If you're interested
in other species,
there's a lot of work happening
in plant, rice, other animals.
In fact, we're going to hear
a bit about some other species
in a minute.
Yeah.
Yes?
Hi.
AUDIENCE: Thanks for the talk.
One of the practical problems
that I'm currently facing
is data ingestion.
The current biologic
data is scattered
in many research institutes.
JONATHAN SHEFFI: Sure.
AUDIENCE: They are quite big.
In order to make use of the
capability in the Google Cloud,
I would have to try
to figure out ways
to move that data in or use
the data-- pipe that data in,
then ingest in, and then
use it as-is right there.
But also, I want to know if
you have any comments on that.
JONATHAN SHEFFI: Sure.
So, I just want to make sure I
understand what you're asking.
So, I think what you're asking
is you have data sets that
are, say, under
the desk of one lab
or under the desk of another
lab or in that sort of a lab
server, and migrating
that data to Google Cloud,
or are you talking about
moving from a data center
to Google Cloud or both?
AUDIENCE: More from the
different data centers.
Let's say, one data center from
Sander Institute, one from--
another data center,
let's say, from EBI--
and I want to try to
move to that Google Cloud
in order to do
something meaningful.
JONATHAN SHEFFI: Yep.
So, we have existing
transfer appliances
that make it really easy
to bring some large data
sets over.
If we're talking
about really, really,
really big data sets, there
are special arrangements
that our customer
engineering team
can work with you on to get
those data sets into our data
regions.
You get a follow up.
Go ahead.
AUDIENCE: One quick
follow up that you--
sorry, Google just made an
announcement today about data--
oh--
JONATHAN SHEFFI: There were
a lot of announcements.
I can't-- even I can't
remember them all.
AUDIENCE: [LAUGHING] Sorry,
it's about data ingesting--
Data Fusion, or--
Data Fusion.
Yeah.
JONATHAN SHEFFI: OK.
AUDIENCE: Will there
be any Data Fusion
adapters to be able
to make use of--
to help us that way?
JONATHAN SHEFFI: So, I don't
have a good answer for you.
We haven't talked as
much about integrating
with the Data Fusion team,
but it's a great question.
Let me follow up with you.
We've exchanged emails, but
we can talk about that more
separately.
Yeah.
Yeah?
AUDIENCE: So, with the public
data sets have you curated them
in any form, and have you
published the curation steps
or is it as-is?
JONATHAN SHEFFI: They
are provided as-is.
So I mentioned, we have product
managers, engineers, solution
architects.
I didn't mention that
we have scientists.
So we don't really
have the background
with which to necessarily
scientifically curate
some of these data sets and
make judgments on the metadata.
So we are not providing that
as part of Google Cloud.
However, a number of our
technology partners--
so I mentioned Terra,
which is a platform that's
developed through
a collaboration
by the Broad Institute, and
Verily does offer some curation
tools and we're
also discussing how
we can provide more
effective metadata management
of scientific data sets.
Is there a particular use
case that you're thinking of?
AUDIENCE: Any one of them.
The usual use case in pharma
is that everyone curates them
over and over and over--
JONATHAN SHEFFI: That they do.
It's true.
AUDIENCE: Everyone's
got [INAUDIBLE]
JONATHAN SHEFFI: Yeah.
Yeah.
And I think that if we curated
them, to be perfectly direct--
I think if we curated them, it
would be yet another curation,
and I don't know
if we could do it
any better than a pharma company
could or a scientist could.
AUDIENCE: It's [INAUDIBLE]
having a common source of--
a common data set that
everyone works off,
a common curated data
set, rather than--
because each rules might differ.
But anyway, that's--
JONATHAN SHEFFI: If
you can get the world
to standardize on a
common metadata ontology
for describing
scientific data, I'll
give you 20 pounds out
of my pocket right now.
[LAUGHING]
AUDIENCE: I won't
take you up on that.
JONATHAN SHEFFI: All right.
I think I've got time, I
think, maybe for one more,
if we've got one.
Great.
OK.
So next, I'm going to
introduce Anita Nandi, who's
a research scientist from the
Oxford Big Data Institute.
She's going to talk
a bit about her work
with the Mosquito Data Atlas.
Anita.
[APPLAUSE]
ANITA NANDI: OK,
thanks very much.
Yeah I'm Anita.
I'm a researcher at the
University of Oxford,
and I work on mapping malaria.
So, just first for a bit
of background, as you know,
malaria is a disease that's
transmitted by mosquitoes.
So mosquitoes have a parasite.
If they bite a human,
they transmit the parasite
into the human's blood.
If a non-infected mosquito
then bites that human,
the parasite is then
transmitted to the mosquito,
and the cycle continues.
And this causes a major
public health burden
globally and causing several
hundred thousand deaths
every year.
And most of those areas
are in sub-Saharan Africa.
And so there's a global effort
to help reduce and eliminate
malaria.
To do that, we need to know
where the malaria burden is,
and that's where our
research group comes in.
We're called the Malaria
Atlas Project, or MAP,
and we are a very
interdisciplinary team
of researchers.
We're epidemiologists,
statisticians, ecologists,
physicists, and others.
And we provide malaria estimates
to try to estimate the burden--
the global burden-- of malaria.
And while malaria is decreasing
globally, which is great,
this does mean that
it becomes harder
to model because as
burden decreases,
we have to develop
new techniques
to model low-burden areas,
which is more complex.
And so this is something
our group does.
And we model at very high
resolution, very fine scale.
And because these kind of maps
are vital for policymakers
to understand where to
target interventions
and how to start
eliminating malaria.
So, we have a lot
of malaria data.
We have both point data--
so, surveys conducted across
the world on testing
people's blood for malaria--
what proportion of people
have malaria at this point.
But we also have
data in regions,
so, how many people came
to hospital with malaria?
How many cases are
there in this region?
And so we deal with large
amounts of different types
of data across the world.
This data, can all, by the way,
be found on the MAP website
and downloaded.
And you can see from
the map that some places
have some types of
the data, not others.
And so this results in a very
interesting and difficult
problem of how to model
malaria risk across all regions
at fine scale.
And that's what we work on.
And to do that, we use a
lot of environmental data.
So, as we've seen, malaria
is transmitted by mosquitoes,
so suitable habitats
for mosquitoes
are more likely to have malaria.
So warm, tropical climate
near stagnant water
is a higher chance
of malaria risk--
things like this.
And so it's very useful to
input environmental data
into our models to
give us information
about malaria distribution.
But not just environmental
data-- maybe human data, things
like accessibility to cities
and other such variables.
And so we have global satellite
images of these variables
and we're mapping at five
kilometer-square resolution.
And so this means
that in these maps,
there's over 18 million
pixels in each one,
and we use many more
variables than this.
And so all of this is
going into our models.
So how do we use Google Cloud?
So, we're taking all of this
data and what we want to do
is we want to run our models
on this data to produce
global maps of malaria risk.
And using Pipelines
API, what we can do
is we can run a single command
telling it where our input
data is in cloud storage,
what size of machine we want,
what container image to use
to define the environment,
and where to output our results.
And with that single command,
it then spins up a VM,
runs our model, saves
the output where we asked
and then shuts down the VM.
And so you only pay for
that machine for the time
your model is running.
And so that's very
useful for us.
But in this situation,
we can run--
it's not too
computation-intensive
for us to run this kind of
model on our on-premise servers,
although it is a bit slower.
However, this has produced here
a single map of malaria risk,
but it's not good enough for us
to say to policymakers, oh, we
think 40% of people
have malaria here.
They'll want to know, oh does
that mean between 39% or 41%,
or could it be 50%?
They want to know a
level of uncertainty.
And these are very
important questions
when thinking about where to
target your interventions.
What is your uncertainty?
And so to quantify uncertainty,
what we do is we run many--
in this case, 100
similar models--
varying our input slightly.
And with Pipelines
API, what we can do
is we can run the
same single command
but with just an extra
argument to tell it
how to run those
models differently.
And we run that and it does the
same thing, similar to before,
except it spins up 100 VMs,
runs each of the models,
and outputs 100 maps
of malaria risk.
And so then what we
can do with that is we
can summarize those 100 maps
to produce our predictions
and with uncertainty.
And so that takes exactly the
same time as it would for one.
And so this is where the
power of Google Cloud
comes in to help us.
So if we compare this with our
in-house on-premise servers,
I might have a single
64 core machine.
So I've been talking
about these global maps.
We actually run the
maps region by region,
and we have seven regions.
So instead of just producing--
during one run for the
globe, we do seven different.
So, our on-premise
servers to produce
100 maps for a
single region would
take about a day
at full capacity,
and then we have to do it for
each of the seven regions,
so producing global maps
would take about a week,
whereas on Google Cloud we
can use many small machines
and run each of the
regions and the realization
so there's 100 maps
simultaneously,
and that would then produce
our outputs in less than a day.
And we can also use
preemptible machines
that were mentioned earlier
to reduce our costs.
And so we're going
from our turnaround
time of producing global
maps from about a week
to within the day.
And this really
benefits us because we
work with organizations
like the WHO, who
are a big organization
with many tight deadlines,
quick turnaround times.
And so it's important
for us to be
able to produce high-quality
maps fast, basically.
And so the less
time it takes for us
to run the models, the more
time we can spend scrutinizing
the results, iterating over the
results, improving the model,
improving our predictions,
basically, for malaria.
So here I'll just show
some of our results.
So this is malaria
prevalence, the proportion
of people that have malaria
globally from 2000 to 2017.
So our models are
varying in time, as well.
And you can see that the
prevalence is much higher
in sub-Saharan Africa.
It's also going down with time.
And so also what's important
in these predictions is,
where is it increasing?
We want to know that.
We want to know, and why
would it be increasing.
Is there the lack
of intervention?
Is there development of
insecticide resistance?
So these maps can help us answer
these kind of key questions
in order to reduce malaria.
Then we don't just
produce prevalence,
we also produce other fact
things, like mortality.
So this is the number of deaths
per 100,000 people each year.
And so, similarly,
you can see it's
high in sub-Saharan Africa.
But this may be different
to our prevalence estimates
in the sense that you might
just not get high mortality when
you have high prevalence.
It might also be affected by--
do these people have
access to health care?
Do they have access to the
effective treatments, things
like this.
And our group also
works on modeling those
and understanding those
contributions to this stuff.
So, in summary, we produce
global malaria risk maps
using Pipelines
API, and it makes
the process about 10 times
faster using Google Cloud.
And this allows us to
continually improve our models
at a faster rate and
for the deadlines
and also it gives
a scope to develop
more complex and
computationally intensive
models going into the future.
Thanks.
[APPLAUSE]
Shall I move on?
Is that the plan?
OK, so next I'll introduce
Vivek from the Wellcome Science
Institute.
Thank you.
[APPLAUSE]
VIVEK IYER: Thank
you very much OK.
Hi, everybody.
I'm Vivek.
I run the Human
Genetics Informatics
team at the Wellcome
Sanger Institute,
and I'll briefly go
over the way that we've
been using the GCP, and
specifically, Spark on the GCP
to run a large-scale
analysis of human variation.
I've got to say I'm actually
chuffed that I was actually
asked to do this, and I'm
very happy that I can actually
share the story with you today.
So, thank you, Google.
OK.
So what is Sanger?
Sanger is a research institute
in Cambridge inside the UK.
It sits in the Genome Campus.
It's got about 1,000 staff.
It sits on a campus, which
actually includes the European
Bioinformatic Institute
and various, sort of,
scattered biotechs.
It's a research institute
based on a faculty model.
Basically, it's
subdivided into programs
like HumGen, pathogens,
cancer, and CellGen
and they rest atop a
core of facilities--
the sequencing and the IT.
Specifically, I run
a team that supports
the human genetics program and
there are different research
groups inside that
program, for instance,
that may research
neurodevelopmental disorders--
that's a rare
disease, IBD, which
is a common disease,
and actually
complex trace cardiometabolic
trace, blood pressure,
diabetes, this kind of thing.
The broad aim of Sanger
is to use genome sequences
to advance the understanding
of the biology of humans
and pathogens and to
improve human health.
That's an extremely
broad aim, right?
But very specifically, the
project I want to talk about
is called Interval.
It involves the whole
genome sequencing
of a cohort of 12,000
healthy individuals.
These are walking
and talking, they
don't actually--
they're not afflicted
by any sort of nasty disease.
We took those 12,000 individuals
and whole-genome sequenced them
at about 15x, a little
bit more possibly,
and ran both standard and
GTAK variant calling on them,
and got 185 million variants
across 12,000 people.
Critically to this cohort
though, not just the
variants, but also
that there are actually
really thousands of measured
phenotypic traits per person.
So there are hundreds of blood
cell counts, lipid levels,
and whatnot, but also there
is a full proteomic panel
and metabolic panel
run on each person,
and together they'll add up to
a couple of thousand phenotypes.
So, those phenotypes
actually give this study
a quite unique window
into the molecular causes
of complex disease,
and it's a sort
of a leg up on the
average huge GWAS.
OK, so I've mentioned
what the science is,
but really, this is sort of
the packaging of the work.
I've talked about the
science, but really the thing
I want to point out
about the science
is that it's
actually urgent now.
And the reason is that the
landscape is competitive.
And whilst this study was
actually interesting and unique
when it popped up
two years ago, it's
now getting steadily eclipsed
by other large cohorts.
OK, the field is
competitive, especially
if you think if you
could buy a bank.
Specifically then, the
late delivery of this data
means that it loses relevance.
And whilst nobody is going to
get fired or lose their jobs,
we're all at risk of
loss of future funding
if this stuff actually
shows up late.
And the work involved, as
far as we are concerned,
after the variant calling, which
is sort of pretty standard,
was again, sort of
standard variant QC.
What makes it non-standard
is that you're
starting with a very
large number of variants
and a relatively large cohort.
So when the QC runs, about
30% of those variants
will actually be
filtered out using
standards that have
learned thresholds
on various parameters.
And then after
that, the fun starts
because that's when all the
association analysis starts
and the statistics
gets going, which
actually gets the association
to the phenotypic trace.
The tool we decided
to use for this
was the Health
Framework by Broad.
I like it.
It's an expressive language.
It's sort of basically like
doing R, so it's actually
critical for us to use
this because it means
that different
people can actually
look at each other's logic
and very quickly understand
whether that's
actually appropriate
or not, and that's a constant
detraction between us
and the postdocs and
the faculty groups,
so we have to be able to
actually share the logic around
and actually critique it.
And if you code it up in
a bundle of dense Python,
that's actually hard to do.
Secondly, we're writing R-like
stuff but it operates on very
large, high volumes of data, so
it does things that R cannot,
specifically because it's
actually designed to run on top
of an Apache Spark cluster.
So that's actually
why I chose that tool.
I should point out
again, before I
leave this slide, that urgency
is actually critical here.
The landscape is competitive and
I can't afford delays of months
while we sort out
technical hoopla.
So it turns out there's
plenty of Informatic support
at Sanger.
If you like, this is
what our tech stack--
this is what the group
stack looks like.
We're in the middle.
We're the HumGen
program, but above us
we are servicing the
Human Genetics faculty
and we rely on the services of
high-performance specialists
and on-prem cloud
specialists below us,
and they rest on an
IT core below them.
So you might ask,
why on earth we're
turning to GCP in
the first place.
And really, it's a
combination of two things.
So we're going sideways out
to GCP for our Spark cluster
services.
So what's really happening
is driven by urgency.
That's the first thing.
I can't afford to faff around
waiting for my Spark clusters
to come up locally.
Second thing is there's a
strong precedent for Broad.
So Broad wrote HAIL software,
which is, of course,
what we're using.
They have a strong precedent
for using Google to do
all of their Spark processing.
So given that, it's
actually very easy
to sort of follow
down a well-worn path,
and I'll explain
why in a minute.
In a sense, this is how
the tech stack evolved.
We started with traditional
high performance
compute, which at Sanger we
understand extremely well.
We're really, really
good at spraying stuff
across large clusters
and gathering up results.
But I don't think that suits
this particular task very well.
We did, and have built,
local Spark clusters.
They are straightforward to
build, as straightforward
as anything else in a
local and on-prem cloud.
We found that the
performance requires
a lot of tuning,
especially when you
get to genome scale
data sets-- that's
the first thing-- and
actually, big cohorts.
We settled on using
Spark clusters
in Google Cloud in Dataproc.
Why?
Because it's easy to deploy
and we have Broad recipes
that spin these
things up in minutes.
The provisioning is reliable--
if it came up yesterday,
it'll come out today.
And the performance--
this is critical--
is actually predictable.
If it took an hour
yesterday at $100 an hour,
it'll do the same thing today.
And we're not getting buffeted
by other users competing
for these resources.
And finally, when
we hit problems,
and certainly we
hit problems, they
can actually be
debug with Broad.
And that's quite
important because they've
hit the problems before.
We write to them, they tell us,
oh, try this obscure parameter,
and hey, la, it actually works.
So what I'm going to take
you through in the last part
of this talk is, in a
sense, the little bit
of the jitter in the
data flow and a couple
of things that were cool
and a couple of things
to watch out for.
So, as you can see, Sanger is
on top, Google's on the bottom,
and there's stuff that are
flying back and forth from left
to right, from top to bottom.
The first thing
that had to happen
is we had to get legal
and ethical approval
to get our human data
to enter the cloud.
This is important, OK.
We answered-- we took a set
of due diligence questions
to Broad about
security and privacy.
They promptly shot back a
bunch of comprehensive answers
and then eventually, our
legal and ethical folks
went, yup, you can do this.
I should point out
that whilst Google
was prompt within a
matter of days answered,
the process internally to
get through legal and ethical
was actually anything
but prompt and was
the timescales are kind of--
yeah, it can take a while.
It can take a really long while.
You just have to
be ready for that.
Then the good bits started.
We got the data into Google.
We ran Spark and HAIL and
to filter the variants.
We took the filtered
variants back out again,
ran lots of plots to make sure
the filtration was effective.
When we realized
it was effective,
we ran Spark and HAIL again to
do the statistical analysis,
and then took the
results back out.
And that all went swimmingly--
it was reproducible, it was
efficient, it was timely.
Everybody wins.
It's a good thing.
So GCP is doing really well.
A few things you need to bear
in mind when this happens.
Firstly, we have
tens of terabytes
of data, which is not
petabyte scales of stuff
for a single project.
Even so, storing
hot data in the tens
of terabyte scales
inside Google will add up
pretty fast when you have it
there for months, for instance.
That's one thing.
Downloading terabytes
of data will actually
add up pretty fast, as well.
And of course, there's the
processing time in Google.
I mean, to do this
effectively and quickly
we had to spin up
hundreds of cores
to actually get the
job done, and that
actually adds up as well.
All of this can be
managed, no problem.
It can be predicted
and managed, you just
have to keep it in mind.
And you have to
be ready for that.
It's all worth it though,
because the results come out
in a way that we expect.
And you can use this
to write papers,
which is what the game is.
It was the whole point to
actually sort of understand
underlying biology
behind the variants
and to actually eventually
improve human health.
I should say, and
it's worth emphasizing
why the timeliness of this whole
thing is actually a big deal.
Firstly, research groups
have staff which turn over,
often within three years, a
typical postdoctoral fellow.
It's not OK to sit on top
of data for six months
because the postdoc
might be leaving,
or that's six months less
that the postdoc gets actually
to write the paper.
It's totally not
OK to actually sit
on top of something
for six months
while your tech stack spins up.
The second thing
is, at the moment,
this data set is the most deeply
phenotype data set whole genome
data set in the world.
It's actually effective
for two reasons.
First, you can add value
to bigger GWAS studies
because it actually can
point to molecular mechanisms
behind associations that other
big GWAS studies have actually
simply indicated where
they've indicated
a link between a variant
and affinity [INAUDIBLE]
can actually skew the
molecular mechanism.
The second thing
is that it can act
as a pilot for those
big GWAS data studies
when they're actually
pondering deep phenotyping.
This is a relatively small
study with very deep [INAUDIBLE]
typing.
All the more reason that this
data actually has to get out
and the science has to get
out as fast as possible.
The longer we sit
on top of that,
the less relevant
all of that becomes.
So that's basically
what happened,
and what I'm saying is that
GCP and Google and Dataproc
have actually facilitated the
rapid scientific turnaround
of this data and actually
helped us actually
do the science faster.
So with that, that's pretty
much all I wanted to say.
I wanted to close with
a few acknowledgments.
Firstly, Hatem Nawar and
Brian Gibson from Google Cloud
have made this much,
much, much easier.
They've facilitated this pretty
much every step of the way.
The scientists involved
in the INTERVAL project
is Nicole Soranzo--
who's the PI,
and her postdocs Kousik,
Klaudia, and Hannes.
Our local on-site specialists
Pete Clapham and Tim Cutts
and their teams.
And of course, Tim Poterba
and the HAIL team at Broad
have helped us immensely
along the way, as well.
Thank you.
[APPLAUSE]
We can take questions
for any of us actually.
JONATHAN SHEFFI: Might as well.
Or not.
VIVEK IYER: It's late.
AUDIENCE: Nice job, both of you.
And so my first
question was to Vivek--
should the statistical
framework and the grand calling
framework that we
use inside HAIL.
VIVEK IYER: Yeah.
AUDIENCE: These
are all predefined?
VIVEK IYER: Yeah.
They have been written
by the Broad HAIL team.
So the variant processing
framework is pretty standard.
It's just filtration with
various different query
predicates and so on.
The statistics is
actually pretty basic.
You either run your bog
standard linear regression
or you run scat, which is
the frequency-weighted burden
testing kind of thing.
And that's about it.
It's about as good
as it gets right now.
AUDIENCE: Have you
tried deep variant?
VIVEK IYER: Deep variant--
OK, so we're talking about
two different things.
We have not so--
we haven't tried deep variant.
That's the first thing.
But secondly, what
we're doing here
is actually downstream,
if you like.
It's the secondary,
tertiary analysis downstream
of all the basic variant
calling, which is just
absolutely bog standard GATK.
AUDIENCE: So you
took the decision
to use the GCP because the
Broad Institute influenced you,
or was it the comparative
study with the other--
VIVEK IYER: Oh, gosh, I wish
we had been that systematic.
Actually, no.
We took the decision to use
the GCP because we could afford
the cost and that Broad had
actually done this before,
and we could spin stuff
up like a piece of cake.
It was just easy.
There's like a one line
recipe they give you,
and up come your Spark clusters.
Not just that
because we actually
had local Spark clusters and
they were not performant.
And honestly, I
don't have the time
to mess around trying
to make them performant
when you've got a PI breathing
down your neck going,
where's my data?
Where's my data?
Yeah, so that's
basically what happened.
Yeah.
AUDIENCE: I was trying to
follow up the question.
In your presentation,
you said that you
have to transfer tens
of terabytes of data
and you say it's
easy, doable, just
have to be prepared for it.
VIVEK IYER: Yeah.
AUDIENCE: Can you
elaborate on that?
VIVEK IYER: Well, OK.
We pushed something of
the order of 10 terabytes
into GCP, which is easy, right.
It's getting the
stuff back out again.
So going in is free.
Getting it back out again
cost 12 pence per gigabyte.
And that's one thing.
Storing it is 83 pence
per terabyte day.
So again, you just have to
add this up across the month.
And then finally, when we run
clusters, we run at about $100
an hour, but we
run big clusters.
Our total intent filtration
and QC and statistical analysis
takes about 24 hours.
So, it's OK.
I mean, the point is, we know
that's what's going to happen.
We're just ready for it.
Does that kind of get to
what you're asking for, or?
AUDIENCE: Yes, to one aspect.
Another aspect is that
how did you do it?
Do you push just
through the wires,
or you shipped it physically?
VIVEK IYER: We definitely
did not ship it.
We just use GSE to copy one
way or the other, or rsync.
Yeah.
AUDIENCE: Thank you.
VIVEK IYER: OK.
JONATHAN SHEFFI: Uh-oh.
VIVEK IYER: Shout.
Hello.
JONATHAN SHEFFI:
Ah, there we are.
AUDIENCE: So I'm wondering
more on a human level--
from your experience,
do you foresee a future
with, for instance,
your institute
would have no clusters
and you would work only
on Google Cloud Platform?
VIVEK IYER: At the
moment, I think
actually having both is
actually pretty good because we
can actually play
locally and then
execute stuff for real
in production in the sky.
Are you asking whether
I can see a trajectory?
At the moment, I'm quite happy
to have both ecosystems going,
but maybe that's just
my mindset right now.
It's hard to change, that's all.
AUDIENCE: So there is a clear
advantage on your computing
times and having a--
VIVEK IYER: That's a
super-hard comparison
to make, right,
because it is clear
that we can run stuff faster
in the sky than we can locally.
It's just true.
But it's not clear
that that will always
be the case if we actually
put lots and lots of effort
to making stuff run locally
made more performant.
The question is,
is it worthwhile?
And that's a bogglingly
hard thing to actually get.
What is true is that
running it in the sky
was actually
completely worthwhile
when you count the scientific
value and actually just
the sort of client service
value of actually getting data
to scientists in a hurry.
But that's really
hard to quantify.
AUDIENCE: Thank you.
JONATHAN SHEFFI: Cool.
Great.
Well, we'll do one
more, only one more,
but I think we are
about out of time.
AUDIENCE: I wanted
to ask this question,
but I just wanted
to give other--
so my question is
to Anita Nandi.
You mentioned you train multiple
models or hundreds of models
by varying different parameters.
Are these sort of predictive
models, machine learning
models, or--
ANITA NANDI: In the case
I was talking about,
these were not machine
learning models.
They were more Bayesian-like,
geospatial kind of modeling.
But we also run machine
learning models and different--
but not specifically
on our global ones.
JONATHAN SHEFFI: Great.
All right.
Thank you all so
much for coming,
and thank you all for
coming to Google Cloud Next.
Enjoy the rest of your day.
[THEME MUSIC]
