Josh Stewart:
I want to thank the organizers for inviting
me.
So I'm going to talk about our method called
Paradigm which integrates multiple types of
data on patient samples for inferring what's
going on in these cancers.
So as folks know in TCGA, we generate lots
and lots of data and it's often referred to
as a flood.
The Broad calls their system Firehose [spelled
phonetically] for an appropriate reason.
And my point is that when you participate
in these projects you often want to do lots
of different types of comparisons from comparing
expressions and methylation to figure out
why something's not expressed or, you know,
looking at the copy number and expression
and methylation all together.
This quickly gets out of control.
You have lots of combinations of things you
want to look at and it can be overwhelming.
More importantly, when you're thinking about
a gene and trying to figure out what's going
on with that gene is it active, is it not
active, you've got all these different pieces
of data telling you different things, you
feel like you're at this stop light and you
don't know whether to go or not.
Well, at least that's how I feel and this
is often how many of us feel [laughs].
This is what it makes you want to do.
This is your brain on all this types of data.
So our particular approach is to say let's
use a knowledge based approach and the analogy
I like to use is you're kind of like a detective
or a car mechanic in this example and each
patient is a different accident, let's say.
And something different went wrong and some
things are more serious than others here.
And if you could try to do data mining on
these car wrecks, if somebody handed you a
ream of data, how fast the car was going,
the direction, what people were saying.
Some of it's relevant; some of it's not relevant.
You're going to be better off if you use knowledge
about how the system works and I like car
talk so I'm showing Click and Clack here,
right.
People call them because they know a lot about
cars and can figure out and diagnose the problem.
And this cartoon shows a radiator running
off and the mechanics looking in the engine
and saying, "I know what the problem with
the car is, that you don't have a radiator."
Now you laugh, but with this data set, you
know, it took a little bit of knowledge in
this case to know what was missing.
So in cellular systems obviously we have put
together at least some of the circuitry and
the machines inside cells and so we should
use those.
And I'm going to show you a system that defines
a computational model to represent these types
of systems and we benefit from all these efforts
out there and there are many I didn't list,
that's the ellipses at the end, we've drawn
from Reactome Kegg, BioCarta, NCI PID, many
different institutions and our favorite of
course combines all of them Pathway Commons
from Memorial Sloan-Kettering.
And so we try to suck in all that data to
learn something about what's going on in a
cell.
So to motivate why we want to do this beyond
just data fusion just think of a simple example,
we've got a transcription factor and you're
looking at the expression of the transcription
factor and there's let's say three different
transcription factors shown here.
You know, you've got two that have high expressions
shown in red and one that's lower expression.
And we know that expression isn't everything
and so it's almost a teleological argument,
but how do you figure out whether something's
working or not?
How would you figure out that an enzyme's
working?
You know, even if you had magic goggles and
you could look inside a cell and see that
it's bumping around and moving in a cell and
chewing things up, you're going to look at
its secondary effects.
You're going to look, did it actually metabolize
substrate?
Or did it, it's a kinase, did it actually
phosphorylate a target?
And for a transcription factor, is it turning
on its targets?
Right, and so that secondary evidence tells
you something about the activity of the transcription
factor so in this case you assume or you infer
that the transcription factor's on and that
might confirm your expression evidence.
Another case you might see that oh, well,
the targets aren't doing anything downstream
of the factor and in this case you would think
it's off either the post-transcriptionally
or even translationally.
We didn't activate this protein or it's not
localizing correctly or there's a mutation
that stopped blocking its function or its
co-activators, right, aren't around.
On the reverse, you could have a low level
of expression of factor and yet it's still
enough to have potent transcriptional activity.
So you want to look around the neighborhood,
is the argument here, to figure out what's
going on in these things.
And one more, so that's one piece of the -- of
the puzzle is to look at neighbors.
And the other idea too is, you know, in this
previous example we infer that the factor
was on because of its downstream targets.
But suppose I ask Gady [spelled phonetically]
to give me JISTIC plots now this is a different
type of data, copy number data, and all of
these just serendipitously, all the targets
are amplified now.
And so I could explain a way that those over
expression via amplification, and so I'm less
likely now to think that the factor's on.
Maybe I'm still -- maybe I still think, you
know, over my prior expectation it's on.
But it's not as high anymore because I have
another piece to explain, the up-regulation
of those targets via assist regulation type
of machinery.
So to model all those -- two pieces of information
were also standing on the shoulders of giants
here.
There's been lots of development in the '80s
and '90s and even currently by seminal work
in the field from Judea Pearl and Heckerman
in the early '90s and more recently by Daphne
Koller and Nir Friedman and Aranci Gal [spelled
phonetically].
There's lots of people in this list and I
would recommend folks read this really nice
review article by Nir Friedman in Science
in 2004, so it's getting dated, but it's still
a very nice read.
So these Bayesian networks and probabilistic
graphical models that they describe give us
a very nice way of modeling lots of different
data and dependencies and we can -- we can
learn something from data where we might have
had a knowledge bottleneck before.
And so just a simple example here, let's go
back to the diagram we had from the -- from
the nice work from Sloan-Kettering and the
GBM study and we have a oncogene MDM2 that
is known to inhibit p53.
So there are two parts to this system that
we model, one has to do with the regulation
of MDM2s activity and the other part has to
do with the interaction between it and p53.
And just as a quick toy [spelled phonetically]
example, the model that we have, so when you
see our activities for genes it's actually
a little bit more of a rich representation
that looks something like the central dogma
for a gene, right?
You could -- you have a certain number of
copies in the genome, you can express it,
you can have a certain level of protein and
a certain activity in that protein and all
these variables are beliefs that you infer
from data and these little black boxes show
you constraints that help you infer those
beliefs from data or from other beliefs in
the system.
And you can propagate this information to
infer something about a higher level thing
like apoptosis or activities for these genes.
And that's what we use downstream for our
downstream analysis.
And so the big picture looks like we take
a cohort of patients, various types of data,
run it through our pathway models and then
we produce one matrix that we can now do analysis
on.
So we don't have to think about all these
different modalities anymore, we can just
think about is the gene active in the sample
and provide this new matrix for analysis.
So for the ovarian study, the obvious signature
here from the paradigm analysis was this FOXM1
signature so when we zoomed in on this, all
the patients pretty much had a up-regulation
of this known mitotic regulator, FOXM1.
The slightly more interesting story about
it is that it has two isoforms and one part
feeds into proliferation, the other part feeds
into DNA repair and there's a lot of disruptions
in the genome and I know all the ovarian samples,
they're getting constituent activity signaling
through like ATM and ATR, turning on genes
like FOXM1 that if they're not being spliced
correctly are promoting two different, very
opposite kinds of things that you want to
happen in a cell, both you know, this proliferation
switch and this DNA repair switch.
SO FOXM1 also regulated BRCA2 for example.
So, very interesting story surrounding FOXM1,
if you take the pathway activities and you
try to define subtypes for the ovarian samples
then the good news there was that we could
actually start seeing a delineation of meaningful
subtypes so this purple cluster shows you
that they have slightly better survival patterns
than the rest of the patients.
We've recently worked on the colorectal paper
led by Rajinder Kaul [spelled phonetically]
and David Wheeler [spelled phonetically] and
in this case the story isn't so much FOXM1
but activated MIC throughout and that's an
interesting piece of information.
As we see in the mutation data and other types
of genomic perturbations when TFG-beta signaling
pathway genes are mutated and those all impinge
on this mis-regulation of MIC and that also
bears out in the pathway analysis.
And so one other type of analysis that we're
doing with the pathways is we can take two
groups of samples or patients and look for
markers of one subtype versus another say,
and then hone in on sub-networks that are
markers for a particular cohort.
And we're working on this for the luminal
basal comparison.
So in the breast cancer model and just to
show you, this is the closest we get to the
dreaded hairball, but you can -- you can see
that you know there's so blue is more expressed
or more active than luminal.
And you can see the expected sort of ER signaling
pathways and then you have some other intriguing
pathways among the proliferative ones for
basal shown in red like F1-alpha, for example.
So, the way we can use that hairball is to
do something like a master regulators analysis
like Andrea Califano [spelled phonetically]
likes to do with ARACNE.
You can look upstream in this example of a
-- of a basal marker such as FOXM1, like I
showed and sort of by chain of reasoning,
up the regulation hierarchy you see that there's
a polo kinase.
And so the prediction there is that basal
cells will be more sensitive to a polo kinase
inhibitor.
And this actually pans out in a cell line
model shown in Joe Gray's [spelled phonetically]
lab with his cell lines.
So this plot here shows you sensitivity to
a polo kinase inhibitor for basal and claudent
[spelled phonetically] lows contrasted against
those in luminal cells.
And the reverse is true as well.
You can look up a marker for luminal, like
a luminal hub and in this case it was an HDAC.
And so the prediction is that an HDAC inhibitor
would be more sensitive in luminal cells and
that's what turns out to happen in these cell
line models.
And you saw a nice example yesterday from
Sam Ng [spelled phonetically].
Just to go through that real quick because
I wanted to show you one more result that
Sam didn't have time to show.
So he's developed a clever method where you
can run our pathway analysis twice.
One where you connect the gene downstream,
to its downstream targets, infer an activity
for it, another where you connect it to its
upstream targets and infer an activity.
And just look at the difference to get what
he calls the discrepancy in the activities
that are inferred.
And he showed you an example, sort of a positive
control for Rb.
You can see that the mutated cases, he's seeing
a lower discrepancy which corresponds to a
loss of function event.
And he showed you the pathway surrounding
these things.
So we've tried this for a few positive controls
and he showed you p53.
And you can kind of squint and see that for
the cases in red around the circle plot, the
tick marks are patients, sorry, I didn't mention
that, you can see a lower activity being inferred.
And so I asked Sam late last night actually,
"Can you please run this for the lung squamous
results?"
And as you saw before he was predicting for
NFE2L2 this known oncogenic gene that he's
getting a positive discrepancy.
And there are 30 mutations in CDKN2A and consistent
with, you know, other deletions, homozygous
deletions in CDKN2A, he's predicting loss
of function.
So that's interesting.
But now the power is, and these are sort of
for more frequent like events, but you can
now start actually drilling into some of these
more lower frequency events and there are
some intriguing stories I think in there.
But and I wanted to just point out that some
of its highest scoring discrepant genes now
are not the most frequent, right?
So you have a -- you actually have a HIF,
a hypoxy-inducing factor up here in seven
samples.
Why would that be?
And among these up here are going to be possible
new targets that you could go after for your
drugable genome, for example.
So we even have a map, kinase-kinase up there
that might be worthwhile.
And on the other end of the spectrum, there
are some other loss of function events that
we would, might want to pay attention to.
So you might ask, "What do you do if you don't
have good pathway models for genes?
How can you infer activity?
Or do these mutations mean anything?"
You can plot them against clinical information.
And so this is just sort of an overview of
-- you can show some phenotypic information
against these pathway activities and infer
a connection between mutations or phenotypes.
And just really quickly, since I'm almost
out of time, we've done this for -- piloted
this in the colorectal study and you can cluster
the mutations based on these signatures and
you can see you can look up that APC and p53
tend to have the same correlations in the
colorectal study, for example.
And it confirms that APC mutations are correlated
with MIC activity, in this case anti-correlated
with the repressed targets of MIC.
And on the other end of the spectrum you have
TGF-beta pathway mutations, so those cluster
together.
And in the middle you have RTK and PI3 kinase
pathway mutation.
So, the obvious idea here is if you have a
mutation in gene X and it has a -- and it
looks like it's associated with the same activities
in different, in possible different patients,
perhaps it's also acting in the same pathway
based on this type of association analysis
that Ted's [spelled phonetically] doing.
And so I'm basically out of time.
I'm going to skip to the end.
Obviously we want to use these to look across
multiple cancers.
The pathway activities give us a way to do
that and we're working on pan cancer analysis,
a basal comparison to ovarian for the breast
work and so one.
So I hope I showed you that we have a nice
model for integrating a lot of different data
sets.
We use knowledge about pathways.
We're trying to expand that with predicted
interactions now.
We can stratify patients with that, find predictive
sub-networks and so on and use it to predict
hopefully more of these rarer mutations.
And the beliefs allow -- the inferences allow
us to connect cancers across different data
sets.
And hopefully, the last slide that I just
skipped there, it was just trying to make
a point that we can connect subtypes together,
maybe get a clue about therapies.
So, I wanted to just say a special thank you
to the Broad team here.
They've got PARADIGM working and Firehose
[spelled phonetically] and this is not a trivial
feat.
And a lot of these big network methods, by
the way take a lot of CPU time to run so this
is really nice that it's going to put the
results in the hands of public actually.
And so you don't have -- you don't have to
go off and implement these yourself.
And this is my group that worked on the integration
analysis.
I've highlighted the work of the folks circled
there, especially Sam Ng who you saw speak
earlier.
And this is work in collaboration with David
Haussler who actually heads the whole team
and Chris Benz and Jane Ju [spelled phonetically]
ran a tutorial yesterday and she runs the
engineering staff.
So thank you and I'll take any questions.
Sorry I went a couple minutes over.
[applause]
Male Speaker:
Time for one quick question for Josh.
Josh Stewart:
Crystal clear.
Male Speaker:
No, okay, well I'm sure he'd be happy to take
it up over coffee if something emerges.
So thank you, Josh.
