>> All right, I am delighted to be able to
introduce friend and colleague, Jonathan Dursi,
who's coming from the Center
for Computational Medicine
and the Hospital for Sick Children in Toronto.
He started his academic career with an
undergrad at St. Mary's University in Halifax.
He did a master's in physics at Queens.
He started doing out lattice QCD and then
wound up going into Galaxy simulations,
which as he put it, was stepping
up a few scales.
He had a year at Waterloo that he doesn't speak
about, and then transferred to the University
of Chicago where he was in the
FLASH Center with Mike and me.
He was there on a Krell computational
science graduate fellowship.
And the three of us while
we were there were inducted
in the Joseph Sheffield Memorial Society.
Following that, he did a postdoc at CITA.
And then at the SciNet High Performance
Computing Center, he was on the staff.
He then went to the Ontario
Institute for Cancer Research.
And as I said, is now at the Hospital for
Sick Children, where he's going to tell us
about HPC big data and large scale genomics.
Dinner is going to be at the
Bench, and if folks want to join,
we're going to converge there around 6:30 or so.
Please feel free to come along.
>> Thanks Alan.
>> So I love coming to places
like this because my interest is
in computational science pretty
broadly as you may have guessed
from Alan's recitation of my checkered past.
But it's a tough crowd to give a talk to, right,
because everyone has deep computational
backgrounds but in very different areas.
So like a deep dive into the gnomic stuff
I'm doing isn't going to be interesting.
So I'm going to try doing something
a little different this time.
So I'm going to start talking with the
high-level stuff that I've been sort
of worrying about/short of doing something
about, and talk about why I think that maters.
And then towards the end sort of talk a little
bit about the sort of things I've been doing
and why these plug-in to
what I'm thinking about.
And so what I've really come here today
to tell you is that there's a lot of talk
about the convergence of HPC and big data.
And if this is actually going to be a
thing that's actually going to happen
in the near term, I suppose in
some dystopian exoscale future,
it's going to be because of the
needs of large-scale genomics.
And if it does happen for large-scale
genomics, it's going to happen because of places
like this, where there's a bunch of people with
very different expertise in very different areas
who can actually all talk to each
other and get some cool things done.
So I've been working at this particular
coalface for a little while now.
Alan's already told you, I was
at the DOE ASCI Flash Center.
I was doing real hardy, stick-to-your ribs
type HPC, FORTRAN, MPI, adaptive meshes.
Joined an HPC center after a postdoc doing,
you know, a broader range of science,
a slightly broader range of tools.
And it was really, really cool to see all
the different sorts of problems people had.
And this is my -- so at the start of
my career, things were very stable.
And then this 20 years of kind of
things being the way they were.
Big computing was MPI over a rock load of X86.
No one outside of academia was
doing much of this sort of thing.
There was a pretty stable set of problems.
This is my picture of the state of
the world in this more innocent time.
It was a simpler time.
There was statistical computing was
largely the realm of social scientists,
a little bit of experimental science,
physical scientists did this sort
of thing that we were just describing.
Not a lot of sort of database
stuff in academia but, you know,
a little ferry than ran along this river
between the statistical computing side
and the sort of enterprise computing side.
And maybe very infrequent ferry service between
the statistical side and the MATLAB side.
And then they came.
So the widespread adoption of computing
and networking brought a ton
of data really, really quickly.
The first people off the
bat were people who stood
to make money off it, these
internet scale companies.
And so they generated things like Hadoop and
HDFS, which just generated an entire ecosystem.
And just sort of by dumb luck, genomics
was just becoming a thing at the same time.
Right place, right time.
The human genome project had just
sort of wound up successfully in 2003.
High throughput sequencing was
starting to become a big thing.
There were lots and lots of
data and how to process it.
And crucially, these sorts of
problems were arriving on the scene
after HPC was already quite mature and quite
optimized for the problems it was solving.
On the other hand, big data
was just this wide open thing.
And I do want to remind everyone
how new this is.
I've already been asked once
about like, you know,
can a physical scientist meaningfully
transition into something like bioinformatics?
Bioinformatics is like 10 years old.
Nobody knows how to do bioinformatics now.
It is very, you know, the field is wide open.
So I start going to genomics in 2013.
You know, they have huge computing needs,
super interesting algorithmic problems.
And hey, you know, HPC took a rest, I got this.
So I made the move in 2014 to Ontario Institute
for Computational Research,
working with Jared Simpson.
Who, amongst other deans,
was the author of ABySS,
which was one of the first
open source genome assemblers
that could actually work on human size genomes.
And to do that it was distributed
memory, it was PI based.
So I'd clearly found my place.
ABySS 2.0 just came out.
[inaudible].
But it came out with the new
default mode being non MPI.
And the phrase "never again" came up.
In meantime, GATK, another super -- basically
one of the canonical genomics toolkits,
now finally supports distributed memory
computing with Apache Spark, not with MPI.
So this is what I think the
world looks like sort of today.
Right now genomics is sort of wedged
in pretty close to big data expanse,
but it's not thriving there as well as it could.
So I claim that if anything is going to bridge
these two solitudes, it's going to be genomics.
It's going to be genomics
because it's fundamentally data,
and not simulation intensive.
Truly large amounts of data.
Super irregular.
The problems on this data look
more like no SQL databases
or social networks than they
do like PDs on grids.
But it's fundamentally science.
It's fundamentally doing really subtle analyses
on this data, not just subtle analyses,
but asking questions that have never
been asked before on this data.
And amongst other things, this means that
testing and discovery of methods means
that your methods will need
to scale down super wide.
>> What do you mean by scale down?
>> So I mean the following.
So anything will scale if you
throw enough data at it, right.
But that's how scientists work, right.
When we're trying to find out a problem,
when we're trying to figure out, okay, first,
is this even a question that's sensible to ask?
Two, if it is, is this is a method
which can be used to ask this question?
You can't do that on terabytes or
petabytes of data, it just doesn't --
you're just going to be [inaudible].
You need it to kind of work on a laptop
or a workstation or a couple nodes, right.
And none of the big data
sort of frameworks do that.
They're not designed for that.
And this is one of the big reasons I think why
Spark -- which has other problems for science --
but some of the reasons it's just
not a physicist's friendly tool.
Because you can in principle run Spark on
your laptop, but it [inaudible], right.
It's just such a heavyweight framework.
And that's fine if your use case
for it is petabytes of data.
But the beauty of HPC tools is that they
typically can be meaningfully tested on kind
of a laptop workstation sort of thing.
And this is a feature that is utterly lacking
in most not of this sort of big data framework.
>> Why did it evolve that way?
I mean, if you look at this sort of simulation
stuff that you used to do [inaudible]?
>> So why did it evolve that way?
I think because -- and maybe this
is changing a little bit with things
like TensorFlow and deep learning or whatever.
But I think up until this point, the questions
that we're asking were pretty
well understood, right.
You were going to do a clustering algorithm.
You were going to use one of five
that exists in the literature.
If you wanted to test it, you could do
it in Python on your desktop or something
and just immediately splat
it to Spark as a framework.
But I think for real sort of science --
as opposed to sort of a production thing.
Like once you have a method you know that
works, you can build some production thing
and invest whatever time and energy you need to,
because it's going to be run 10 million times.
So you can kind of do that for science.
But when you're just getting it
started, you're asking deeper questions
than most the big data stuff is doing.
I think this is less true
with deep learning stuff,
where they're still sort of in that same mode.
They're trying to figure out, well,
does this whatever convolutional
neuro network make sense in this case.
There you're starting to see that.
And so TensorFlow will run on
your laptop, but Spark just won't.
It's a big statistics package, mainly
implementing fairly well understood tools.
So that would be my part in that.
So eventually I claim these two logs
are going to end up touching each other.
But I think genomics is making this
bridge faster than anything else.
And it comes out of immediate needs.
For HPC, they recognize that it would
be good to have some of these big data
like approaches but it's focused on exoscale.
Which is fine, but it's more
future looking than genomics needs.
Big data is sort of trying to
relearn these HPC lessons as needed,
but it's doing it one bit at a time.
And so you have individual projects which
are maddeningly have a piece of the puzzle
that genomics would want but not all of them.
So yeah, it's going to take a lot of people
with a lot of different kinds of knowledge
to make this happen to help
with genomic research now.
And to help transfer expertise between
big data and HPC so they're not each,
you know, relearning the same thing.
And it takes places like this
to bring these people together.
>> What if we build a law?
>> And make them pay for it.
>> Yeah.
>> Well, which them?
>> All of them.
>> Big data.
>> So I'm going to start with why I
think this convergence is inevitable.
Why I think genomics is a meaningful driver.
A couple of I won't say increasingly,
but a number of hybrid genomics projects
are projects that have a little bit of each.
And a couple of things that I'm
excited about in the future.
All right, so I claim this is
inevitable because the two camps started
in different places and they're
on different paths.
Simulations, as scale goes up, are
getting more complex, more dynamic.
They just have to.
As real solution goes up, there's
more physics that's involved.
And big data problems, they have, you know,
amounts of data that are going
up, but that's not the driver.
They want to ask deeper and deeper
questions of the datasets they have.
And all the while, the underlying mathematical
tools you have available to you are the same.
So HPC path over the last 20 years
has largely been about scale.
From this ancient, ancient machine, scaling up.
The story's mainly been about hardware.
And the fundamental software stock
has remained largely unchanged.
Although the adoption of things
like open MP, open ACC, open UH,
things like directive-based programming for
accelerators, does show that there's willingness
to adopt new approaches when
they're compelling enough.
Big data's path, like I say, it's more takes
a volume of data, is more or less a given.
And its scaling is less about physical scale
and more about extracting more information
from the data, asking deeper
and deeper questions.
Started with map produced, which was, you
know, page rank was sort of the height
of sophistication for what you
could really pull off of that.
Start moving to streaming data.
Start moving to, you know,
actual machine learning,
actually fitting fairly complex
models of some sort.
And so eventually those things
have to hit each other, right.
Because whatever you do with simulations
or data, eventually you're going to have
to solve a linear algebra problem.
And in big data world, it's
very unstructured sparsity.
And as you go to larger and larger
scale on for something like PDE,
your sparsity patterns get
less and less structured too.
You just start being constrained
by other things than the geometry
and the sparsity patterns start getting weirder.
You see the same thing in
sort of graph problems.
Social network graphs are, you know,
maybe not power lot distributed,
but they're very irregular.
Nodes can have very varying degrees.
Inexplicably -- even though we're Canadian
and my Tweets on HPC are extremely hyperbole,
Justin Bieber has many more followers than I do.
And these are the sorts of graphs
you'd like to be able to do big data.
But when you actually do
something on these complex graphs,
you actually do exactly the same
thing you do in an unstructured mesh.
You do some neighbor exchanges,
you could do some calculations.
It's not a super weird thing.
And in fact, if anything, these social network
calculations are more latency sensitive
than they are in HPC.
Because the calculations are lighter.
And of course, time series,
time series is time series.
You're going to be doing FFTs.
You're going to be doing
some correlation analysis.
There's only so much math available to you.
So these problems end up being the same.
And these two groups are going
to collide into each other.
On the HPC side, the main deliverable of this
has been white papers and vendor press releases.
But there's starting to be some cool stuff out
there, which I'm actually really pumped about
and would be more pumped about if it wasn't
something aimed for five, 10 years from now.
So data flow is starting to become a common
way of expressing complex calculations
on both the big data side and the HPC side.
You set up the flow of data and you let
your compiler and runtime maybe with hints
from you actually decide how that data gets
distributed and what the execution plan is.
When it works, it looks a little
abstract in the way the architecture,
trying to get some performance portability.
This is not a crazy thing to do.
SQL database has been doing
this sort of execution planning
and optimizing steps for a very long time.
The trick is doing it in a distributed way.
Also, on the HPC side there's starting
to be these extended PGAS languages,
with very deep runtime stocks to try to do
some of this reordering at different levels.
Again, super cool but genomics
would like something sooner.
HPC -- or big data really just sort
of things of improving performance.
It doesn't really think about
trying to converge with HPC.
And there's a few things
going on on the big data side.
There's databases where they're starting --
the core engines are starting to look very HPC
in terms of extracting every little
bit of performance on the node.
And on the machine learning side,
they're starting to have more coupled
iterative computing [inaudible] on the hardware.
Again, infuriatingly they're different projects.
Scylladb is very HPC like.
Extract every little bit of out
of commodity hardware with C++
and every little network trick you can imagine.
Farmed from Microsoft research, uses RDMA
and some remote procedure calls over SSDs
to get just absolutely insane
very HPC like latency.
When the network is practically
flooded with requests,
this is making arbitrary distributed database --
distributed dictionary lookup over a
couple terabytes of data on this, on SSDs,
over 20 machines in 30 microseconds.
I mean, this is starting to be
performance you don't necessarily scoff at.
This is actually pretty good.
So on the database side, if all we wanted
to do in genomics was clear your database,
and we were done, there would be tools
for that, but there's more than that.
We want to be able to make these queries but
actually integrate them into calculation.
There's tools like Spark which are getting
more and more performant, which is great.
And you can build things on.
But it's still fairly slow.
There's things like TensorFlow,
which are buildings again
of dataflow execution over tensors.
It drives me nuts.
They're [inaudible].
Right, right?
>> Yeah, it's crazy.
>> I'm going to write a letter.
So, you know, this is starting to
look more HPC, but it's super --
you're actually doing real
calculations throughout GPUs
on dimensional arrays that's
starting to look better.
But it's all so primitive.
You know, legally mandatory
[inaudible] example, a little PD news.
But, you know, you have to distribute a
computation, you actually write the chunks
of a computation out task by task.
So both of these sides are sort of groping
towards some of the same solutions.
Both sides are sort of capable in some cases
of taking full advantage of the hardware.
Dataflows and models are starting to be useful.
Higher-level languages are starting
to take place here and there.
But no one toolset has all the pieces.
And this is what's sort of
frustrating to us in genomics.
So I tend to think of genomics as having
two broad sort of types of workloads.
So one is very search heavy.
Indices, graphs, databases, sort
of typified by genome assembly.
Has everyone here seen my talk about
genome assembly and mathematics?
>> Everyone here is assembled from the genes.
>> So the idea is, you know, you of take a
bit of person, turn it into little segments
of genomes, maybe 250 is sort of the
standard, 250 base pairs out of 3 billion.
And you try pasting these together.
And you have to look up, you know,
if you have some chunk here, GGGA,
does the potential neighbor,
GGA something, exist.
And so you need to do these sorts of operations.
Assembling isn't actually something
you do necessarily super often,
but most of this sort of workflow happens a lot.
You're trying to do string indices or
string databases somehow and one way
or another build some implicit
graph and traverse it.
It's always much messier
than pretty little diagrams.
You always end up getting fragments.
This is a mosquito genome, the best available,
but it's broken into like,
you know, 300, 500 pieces.
>> Do you need to use strings?
I mean, there's only four letters.
>> So they -- so no.
They'll use -- well, strings, but it'll
be four bit or four character alphabets.
Yeah. Although sometimes
more than four for basic.
[inaudible].
This is -- here, let me skip it.
I may not be able to.
These are pieces of a graph.
So, you know, these can be
very small nodes in this graph.
This is one of those extended a great deal.
But assembly, especially with these
short reads, you can't resolve repeats.
Like if there's a cycle, you have no idea
how many times that cycle actually executes.
And also just [inaudible], you
will get fragments, you'll get --
you have this but you don't have this or this.
So a real assembly with just short
reads ends up being sort of fragmented.
And you have to use a much more
[inaudible] approaches to sort of map
on where an actual physical
chromosome is actually [inaudible].
You have to use -- now you're
leaving the realm of computation
and now you're actually causing
people in the lab trying to figure
out where that is actually going.
[inaudible].
Things are a little better.
I'll talk about nanopore reads.
There's long read technologies which
are starting to help [inaudible].
So there's that and there's sort of
more traditional statistical stuff.
But the stats problems get really big.
So instead of ripping my genome up into a
bajillion little pieces, say 100 million --
so now let's say we're leaving
just DNA and seeing
that RNA gets transcribed here
sometimes on the way to protein.
So take that from one of my cells,
the RNAs, and sequence that.
Figure out which genes are being expressed.
But now do that for the different types
of cells everywhere in my body, right.
They all have the same genome, that same DNA,
but they're expressing RNA very differently.
And now do that for, you know, for a
bunch of cells in my body, maybe in a sick
or a healthy state, and then
do that across a population
and start correlating it with other things.
So you get these truly large genomic sort
of statistical problems, which are cool.
But they're a bit more structure.
Okay, so those are the sorts of
things people try to do in genomics.
So I think anyone who's ever worked
in an HPC type center has asked themselves
this question, and this is sort of the result.
These tools are not made for
these problems, which is fine.
These tools are made for
more regular data access.
You know, an ONP parallel four doesn't make
any sense in like a database there typically.
That's not the sort of thing you have.
These sort of neighbor operations, they
can but it's a little more complicated.
Distributed databases don't typically
use libraries that look like MPI or ONP.
People do in fact try and they
succeed and nobody dies from it.
But it's really, really hard, and
it's really, really hard at a time
when people are still developing the method.
So going back and adding a new assembly method
to one of these codes is super, super tough.
On the other hand, people do
have somewhat more left of PGAS.
On the other hand -- okay, so why don't
genomics tools just use these databases I talked
about that had really good performance?
Well, this is again, sort of, see figure
one, it just doesn't work out so well.
There's these very performant low level pieces,
but they're databases, they're
not real libraries.
There's no meaningful way to compose
them into something meaningful.
>> What about like the past [inaudible]?
>> Like what?
[inaudible].
Yeah, so I think that's actually
starting to look more useful I think.
And I think that sort of ties into a little bit
of -- I guess maybe because I learned about them
at the same sort of time, I tend to
put them in the sort of execution sort
of data flow model, which isn't really right.
But I think those do have some
promise, but they're just so new.
On the other hand, using these
tools like Spark isn't any better.
They're too slow, and this is what we were
saying earlier, especially at small scale.
There's this actually scathing paper --
which if you enjoy reading papers that tear
apart other papers, then I really recommend.
It's easy to scale when you have
really, really big overheads
that you gradually overcome
with more and more data.
This is a paper just exploring the scalability
of some of these tools that are these things
that are showing literature to scale very well.
But comparing them to a single thread
execution, you know, a single thread running
on a single node for some of the same
benchmark problems they published.
And they just get absolutely slaughtered
by a well written single code
for these small problems.
Scaling really large scale
isn't necessarily that hard.
Scaling on small scale turns
out to actually be super slow.
[inaudible].
Spark, let's take Spark as an example.
It's lovely.
It has very nice architecture.
But when you're building something for
petabyte to scale, you don't [inaudible].
So I mean, for one thing, just these
JDMs turning on, it's already this crazy
when you're talking about,
you know, dozens of seconds.
But they have all this infrastructure
for distributing computation,
you know, right at the beginning.
And it's just super, super heavyweight.
And it's not even that it's bad.
It's not [inaudible] for these size problems.
Like if your dataset isn't filling
up dozens and dozens of disks,
you have no business [inaudible].
And yet people do it.
So this infuriates me, so I'm
going to share it with you.
So the Broad Institute in Boston
put together this HEL project.
And it's a tool for, you know, a user,
researcher, you know, asking questions
and doing aggregations, sort of this
person x have genetic variant y in them.
There's a two-dimensional array.
You know, you're interactively
doing this data exploration.
And it's all good, it's all fine.
There's nothing wrong with that.
But it's in Spark because it's all
they could manage to find to use
to tackle this problem [inaudible].
A big problem in this case,
it's like a billion entries.
And each of these entries has --
I don't know -- a dozen numbers.
This is not a big problem, right.
It's very unwieldy for researchers
on smaller sets.
And I claim that just the fact this even exists
is a failure, is admission of failure by some
of us who, you know, work with HPC centers
of whatever are providing decent parallel
computing tools for sort of genomics problems.
Right now doing large genomics means
going out and buying huge RAM machines.
So the good news is there's
tons of amazing work being done,
including by people here,
on these sorts of problems.
Just like data structures,
sketching and streaming methods
that are approximate or even exact.
There's fantastic work, but they're
doing it because they have to,
because there's no meaningful way
to do distributed memory problems
in some sort of fairly flexible way.
>> So I have a completely tangential question.
>> Yeah.
>> What about the privacy
implications of these data sets?
>> I am super glad you asked that.
And I would be happy to answer
that question in 15 seconds.
[inaudible].
I knew you would ask that question.
Might be because I've trolled your
troublemaking social media history.
So genomics data is doubling every
seven to 12 months or whatever.
Again, this is a traditional slide to show.
They're enforcing it now in all genomics.
So here it is.
Genomic volumes are doubling every seven months.
And my usual sort of sunny side, I frame this,
underestimates how terrible
it's all going to be.
So there's new devices arriving which
are going to move some sequencing
from specialized core facilities to small labs.
So some of us have seen this movie before.
Instead of having to ship up, you know,
send your data to some center that does this
and three weeks later you'll get it, biologists
are going to be able to do it on their laptop.
Entirely new data types are becoming
available, like direct RNA sequencing.
And all of these problems,
changes in research genomics,
are going to be small potatoes
compared to the actual real elephant
in the room, which is genomic medicine.
So whole genome sequencing is already
starting to become part of the toolkit
for rare diseases or some oncology cases.
$1000 for a whole genome or whatever is a
lot for a test but it's not unfathomable.
And as this becomes cheaper and faster
and easier, it's going to become part
of the standard of care for some treatment.
And medicine is absolutely
enormously mind-boggling huge.
Like in the US, just hospital spending
is 10 times the entire NIH budget.
When medicine starts using
genomic data routinely,
it's going to swamp any research [inaudible].
So if this were just about doing
the larger number of the same sorts
of analysis, that wouldn't be so bad, right.
You'd just need more of the same
sorts of resources as we have.
But it's actually worse than that.
The more data and the better techniques make
possible necessary different kinds of analysis
that involve looking at lots of data forms.
So, you know, metagenomics is one case
where -- here's another one of these graphs.
This is a similar sort of graph.
But in this case each of these things
is legitimately a different critter.
So you take the sample of
stuff with microbes in it --
with human health it could be
wounds, looking for infection.
It could be from the environment,
you could be looking for pathogens.
It could be environmental.
It could be gut flora.
And you're trying to figure out just from these
pieces of genomes what all critters you have,
without even knowing how many critters you have.
And, you know, this is already complicated.
But, you know, now imagine starting
to do this with a time series.
Or try to imagine having the
data set of every known sort
of microbe you might want to probe for.
These start getting really absurdly huge.
On the population genomic side, right now on
human health we tend to think of things in terms
of -- we tend to think in terms of
a single reference human genome.
Which is just an absurd concept.
There's huge amounts of variation
and diversity in people.
And especially in some genomic regions.
There's anything to do with the immune
system, you know, is widely variable.
And there's known differences
between populations.
You know, you get founder effects where you
would have certain variants that just stick
with that population, it tends to
get fixated in that population.
So rather than having a single human reference,
it makes more sense to build a graph reference
that contains large amounts
of the variant genome.
But now creating these graphs is nontrivial,
and using them is much more complicated
and requires much more memory
than a linear reference.
And we've talked about how quickly
genomics are in A sequencing.
That's crazy.
So let me tell you about some projects.
So the approach in genomics has been --
so far has been basically to try to
pick and choose bits of technology.
So there's excellent tools that
make full use of a single node.
And often you're trying to integrate -- you're
trying to run this very complex pipeline
of tools using sort of cloud or big data tools.
So I'm going to talk about three
projects how they're straddling this
and what would be possible
for the divide xxx 3925.
And I was not careful enough
in checking my slides.
My GitHub page has not updated
so it doesn't have some slides.
So I'm going to talk to you
for the rest of this project --
for the rest of our time
together without the slides.
So I want to tell you about three projects
that take different pieces of this.
So one of the projects is nanopore sequencing.
And I'm going to go back and
show you that nanopore slide.
So this is a straight out -- this is a
straight out numerical computing problem.
Genomics data types are getting,
you know, much, much richer.
They're much more complicated than
they used to be, which is awesome.
So we have this little device
here, stapler size.
And DNA gets fed into it with a little
protein that's much like what's happens
when it's actually transcribing
DNA in your blood and your cells.
And it feeds into a little port
and this entire thing is filled
with an ionic fluid, currents goes through.
And as one of these bases works its way through
here, it interacts with the port and the amount
of current that's going through changes.
So we have a straight out signal process
where actual honest to god [inaudible].
And so now we have to calculate [inaudible].
So this ends up looking like an
absolutely bog standard sort of HPC thing.
You end up with using GPUs.
You end up sort of having loops for logarithms.
[inaudible].
And so you can distribute
these sorts of data sets.
But, again, you need to train these models.
And this training ends up looking
much more like a big data problem.
You have large amounts of data.
You orchestrate the [inaudible].
Another problem I want to tell you about is
the Pan-cancer analysis of whole genomes,
which is a project that's just about to come
out -- or the papers will be announced tomorrow.
Where we needed to analyze -- it was a
huge international collaboration involving
about 20 countries, about 30 cancer types,
sequencing data from all over the world.
Some of that data could not
leave their home country.
And we needed to run an analysis
pipeline, uniform analysis pipeline,
in all 2700 of these paired
tumor normal samples.
And we needed to do it in such a way that
people could actually make biological inferences
at the end.
That people could take a look at this data and
say with some certainty, you know, holy cow,
looks like these pairings
are actually associated
with these forms of cancers
and not these others.
So this -- again, you have specific tools,
individual tools, that look like HPC,
but trying to orchestrate this
over an international consortium,
over a bunch of different
clouds and HPC clusters.
These are not -- you know, these
are tools that the big data,
the cloud side, has and works really well.
And so we need to adopt those.
So we were very early adopters of docker.
There's this store, dockstore.org,
where you can actually go
and get all the dockers for each of the tools.
And the common workflow language,
the orchestration language,
allows these things to run.
And when you're trying to, you know, deploy
updates across all these different clouds,
and you're trying to make sure everyone
is running exactly the same pipeline,
this is the only way to do
these source analysis.
These tools exist in the big data
cloud world, and they're just starting
to make appearances in the HPC world.
And that's great.
That's an example of data transfer coming over.
But what really gets into this convergence is
this, the problem we're currently working on,
which is CanDIG, a distributed
infrastructure for genomics.
So in Canada, health, thus health
data, is a provincial jurisdiction.
And it's a very big deal to get
health data to ship it [inaudible].
They take the privacy extremely seriously,
and the privacy regulations are
different between provinces.
For research data, it's not such a big deal.
Because when you're putting together a research
project, and when you're consenting data
for a research project -- which is
a very labor intensive process --
you can write the data sharing agreements as
part of -- for that particular research project,
these particular researchers
can all access the data.
But by 2020, 2025, almost -- you know,
the majority of genomics data
is going to be health related.
And you simply can't negotiate these
arrangements for every single person.
You wouldn't want to, it's a crazy idea.
You wouldn't want health data to be
routinely visible all across the country.
So we're building a distributed infrastructure
where each hospital, each data steward,
keeps its own data and provides
programmatic analysis of the data.
You can query the data with a list of analysis,
and these sites can send their data back,
making sure that not too
much information is on it.
So we're building up this infrastructure
and the stock doesn't look
anything like HPC or even big data.
It's all go and web services.
And it might seem completely sort
of irrelevant to this conversation,
except the privacy implications are
getting more and more finer grain.
So in this post-Facebook scandal
world, and in North America and Europe,
and as patients are getting more and
more ownership of their health data,
and are starting to become in some places
starting to becoming allowed to determine
who gets to see it and how
much access they have.
So now this sort of computation doing --
perming analysis, a distributed analysis,
of data through this exchange
of messages over web interfaces,
as more and more people get control
of the privacy of their data --
which is a good thing -- now that data may not
be -- it's not a question of exchanging messages
between a hospital in Nova Scotia
and a hospital in British Columbia,
it might be actually two nodes in the
same rack in some cloud data center.
And the issue isn't that the data couldn't
be in memory because of technical reasons
but the issue is that the
data can't be in the same bit
of memory because of privacy regulations.
So now what we have is a bunch of nodes
in a rack each with different views
of a larger dataset, that now are exchanging
information via messages with each other
to create some distributive computation.
So this is now actually an HPC problem.
We're actually doing distributive
computations with messages, you know, in a rack,
or in a data center, or across the country.
And we need to be able to that efficiency.
Except the messages aren't,
here's my guard cells.
The messages are, the DNA equivalent of Go Fish.
Do you have any variants that are this.
So we are going to this is
very much HPC plus big data.
[inaudible].
So there is -- there's a big literature on that.
And to actually totally rule
that out would probably be
so restricted you couldn't do anything with it.
But there are really interesting
things you can --
so first, the first way you avoid this
is you only have a prescribed list
of queries you can ask.
So that's how you start.
And you do what you can with those.
But there's a bunch of interesting
things in differential privacy,
which we can talk about a little bit too.
Which means you can't guarantee that some
information won't leak, but you can put,
you know, probability down on that.
And that starts to seem like
a fairly reasonable way.
>> But who sets the vulnerability?
>> So that has to be some regulator, or
it could be actually the user, right.
If it's going to help research
this rare disease that my son has,
[inaudible] that you can find
out who I am or something.
So to solve the problems that I want
to solve when I want there to be tools
that have big data type flexibility
that actually perform along HPC code.
And there are things that I'm really
genuinely uncharacteristically optimistic
about that can make that happen.
And a lot of them are the sorts of things that
are happening by people affiliated with ISES.
With the introduction of nonvolatile memory,
external memory algorithms start
becoming extremely exciting again.
And with nonvolatile memory plus
RDMA on converged ether networks,
you can start being able to do
distributed external memory algorithms.
So you start being able to do these
very complicated string databases
but in a class performing way.
With improved PGAS languages -- like the
er PGAS language, SHMEM and its successors
like open SHMEM, but also things like
chapel that have built-in distributed array
of distributed associated arrays.
That starts to look across.
With these new execution data flow type models,
including this task based model,
it starts to be promising.
And with the, you know, the genomics work
and with the algorithms work that's going on,
that's starting to -- it's really exciting.
This is probably the most
exciting time in, you know,
bioelectronic science probably
there's ever been.
But the only way we can routinely get
answers to the new questions we're thinking
of asking is having centers like this that
bring people with very different expertise
and interests together so
they can talk to each other.
So, you know, the fact that,
you know, being invited to talk
at IHIS is actually super exciting.
Because this is the sort of place
where stuff like that happens.
And that's it.
Thanks.
[ Applause ]
