[MUSIC]
>> Yeah. Thanks for
letting me be here today.
I'm excited to share
some of our work on
explainable AI specifically
applied to science and medicine.
I'm defending at U-Dub
here this quarter.
So just by wrapping up,
and I work with Su-in Lee,
if you happen to know her, over in
compile lab there in
U-Dub computer science.
So maybe you guys,
this group that are very well
sympathetic to this question,
but I'd like to start out and ask,
why do we care about
explainability in ML?
Because a lot of our work
has focused on this,
and to take a look at that,
we can start with just a simple
example of a guy named John.
So he is a typical bank customer,
and like most customers today,
whenever you interact with
a company oftentimes data
about you ends up getting
sent into some kind of model.
In this case, if you're in
a financial institution,
often these models are trying
to predict things that
you care about in terms of outcomes.
So in this case, predicting whether
John has any repayment problems.
So here, a chance of
55 percent leads to
the bank denying his loan.
So in this case, there's
a standard thing that happens
thousands of times every
day around the world,
and it leads to questions from John,
obviously, like, why
did you deny my loan?
If you have a good manager,
you should also ask
the same question, right?
Why are we making these really
business-critical decisions?
But unfortunately, sometimes
for the data scientist who came
in building this model was
all about accuracy for them,
and so they don't have good answers
sometimes when they're trying to
explain what's going on here.
So why does this happen?
Well, it happens often because many
times you run into a trade-off
between a complex model,
where because you have a very,
very large potentially
complicated dataset,
you can take advantage of the
flexibility of these complex models,
and get a lot of accuracy.
But that can lead to a lack
of interpretability.
In contrast, some simple models
can be interpretable
in the right contexts,
but if there are too restrictive,
they'll lead to a lack of accuracy.
Okay? This is what often leads
people to run into this in
practice, particularly in finance.
So if you have interpretability
or accuracy and you can link choose
one, that's a painful trade-off.
For the bank, it's
particularly painful because
accuracy directly corresponds
to money for them,
because accuracy will correspond
to default rates for the loans.
But interpretability
is really important.
These are wonderful icons that
are different on Mac and Windows,
but this is meant to
be a happy person.
So if you're interpretable,
you have a happy customer,
happy manager.
There's also really
important legality concerns
as GDPR and other things.
So there are strong drivers
on both sides of this trade-off
for the many companies.
Now, one thing you can
do is try and make simple
models better, right?
So that you can improve
their ability to retain
the interpretability and
also move towards accuracy.
That's a great approach.
We're actually going
to focus on the other one
though, which is basically,
taking what's already
considered very complex
models and trying to extract
interpretability from them.
Okay? So if you do that though and
you just look at
a complex model by itself,
it could be very
complicated in the mapping,
and trying to explain
an entire mapping space could
just be hopelessly complex.
So instead what we're
going to do is focus on
explaining individual
predictions one at a time,
because those involve, perhaps,
just a small piece of
the overall complexity of the model.
So I don't have to describe
how a model behaves in
all circumstances in order to tell
John what was going on with his loan,
I just need to tell him what
was going on that affected him.
So if we do that,
one way to go about it is to say,
"Let's start with a simple model
and see what this means,
be really specific and concrete."
So here's a a linear model.
Why do we consider
these things interpretable?
Often, it's because they have
this giant summation sitting in
the middle of the model, right?
So we have a bunch
of terms coming in.
These are things about
John, let's say,
these input attributes, and
a bunch of terms come
together and they just get
summed up and sent to
you as some output.
Okay? So we think of these as
interpretable because we can look at
that and see a bunch of things
coming together additively,
and what's coming into
the summation can be viewed as
a credit that's attributed
to each input feature.
In contrast, if we look
at a complex model,
say a neural net or a random forest,
maybe gradient boosted
decision trees, things like this.
Often, there's so much going
on in there that there's not
a nice term that we can
just surface and say,
"This is how the model's going."
So what we're going to talk about is,
how we can explain
an individual prediction in a way
very similar to how
a linear model works,
where again we're going to have
a summation sitting inside there.
But now, instead of having
just terms that are straight from
the definition of the model,
we're going to have to come
up with our own definition
of a credit attributed
to each feature.
Okay? So this Phi function
is going to be indexed
by the feature and it depends both on
the model and the current input.
Okay? So we're essentially replacing
the input to a summation
that you would get in
a linear model with something that
represents the importance of
that feature in
the complicated model.
Okay? So this is
a high-level motivation.
So we can look at this,
and this is actually
really just a form of
an explanation, if
you think about this.
So that's really what we're
doing is just saying,
"Why did this model
make this prediction?"
We say, "Well, whenever
you give an explanation
it has to have some form,
and this is the form
we're saying it has."
Perhaps not surprisingly,
this form has
been used by a lot
of previous methods.
Okay? This is something
we noted before.
So Lime, from Marco here
uses this approach,
and there's a variety
of methods here from
Shapley values based on
ideas from game theory
that I'll get back to,
that are really interesting, also,
produce explanations of this form.
These are all model agnostic,
but there's also ones
that are specific to
particular popular model types that
are typically hard to explain.
So Saabas is one targeted at trees,
also came out of Microsoft here,
and then there's a variety that
target deep learning methods.
All of them come out with
explanations that looked like this,
they have a sum of feature
attribution values
that they sum up and
represent the output.
Okay? So this is
interesting and people hadn't
previously appreciated this
unity among them methods,
and so we named this class,
the class of additive feature
attribution methods.
Which is nice because it gives you
some connection and
understanding about how
this part of the literature
relates to one another.
But you can learn a lot
more than that as well.
Because it turns out there's
various strengths and
weaknesses in these methods.
One really nice thing about
this approach is based in
game theory is that they
come as the unique
solution to a set of
properties that can be
specified in terms of fairness.
When you're explaining something,
fairness as defined
by these properties,
turns out to be
really, really useful.
So we'll get back to that. In
contrast, these other methods.
So I should mention, these
are nice because of that,
but they can be a bit slow.
Okay? Because of the way
they have to be computed.
In contrast, these other
methods are typically
based on what could be viewed as
a bit more heuristic approaches,
but they tend to be much faster.
Okay? So on one side, you have
some better theoretical grounding
but slower computation.
On the other side, you have faster
estimation but less guarantees.
So what we've attempted to do is,
combine these strings together
to propose a method
that we call SHAP,
and this is something that we
presented at NIPS in 2017.
So in order to find this,
we need to say, "How do we define
these feature attributions?"
Okay? So I talked about,
we're going to make an explanation
by summing together a set
of values that represent the
importance of each input.
How can we define those things?
How do we define the credit?
Let's go back to John. So before
explaining to John why
his loan was denied,
and we're going to do it using
this type of explanation.
Of course, there are
many other types you could use.
It's good to start at a base rate.
Okay? So in this case, we have
a base rate of 20 percent,
and this is just, how
often do people get
their loans denied on average?
Okay? So that's a base rate
of loan rejection,
or at least, trouble in repayment.
Now, the prediction for
John was 55 percent, right?
So if we're going to explain
what's so special about John,
what we need to do is
explain how we got from
the base rate to our current
prediction, all right?
Because if we predicted 20 percent,
there might not been
anything special about
John just because we
always predict 20 percent.
Okay. But when John says,
"Why am I 55 percent?"
What we really have to explain is
this 35 percent difference here.
Okay? So how can we do this?
Well, if we assume our
model's fairly accurate,
then we should just
take the expected value
of the output of our model,
that's going to be the base rate.
Okay?
So we can just say what's
the expected value of
the model, let's say,
over our training dataset?
That's the base rate.
Then what we can do is, since
this is an expectation,
we can just introduce a term into
that conditional expectation.
That term here will say, "Let's see.
We condition on the fact
that he is 20 years old."
So on the condition on
the fact that John's 20,
his risk jumps up by 15 percent.
Well, we can attribute
that 15 percent jump,
if we'd like, to John's age.
Now, we can condition on the
fact that he's a day trader,
and that's a very risky profession,
and that jumps up to 70 percent.
Then we condition on the fact
he only has one open account,
like a waiting financial disaster,
but he made a ton of money in
the stock market last year.
So his capital gains pushes
him down to 55 percent.
So what we've done
is, we've basically
divided up how we got
from here to here
by conditioning one at a time on
all the features until we've
conditioned on all of them,
which means we're of course going
to be the output of the model.
So any questions at this point?
>> Yeah. This is amazing.
Is it all independent?
>> Good questions. So let's
assume they're independent
for the moment,
and let's talk about
whether the model
itself is linear or nonlinear,
because both of those are going
to impact how this works.
Because this is not the final
way we want to do this,
because the order really matters
as we introduce these things.
Okay? Either if they're independent
or dependent between
the input features,
or if the model itself is nonlinear.
So let's assume that the inputs
are fairly independent
for the moment,
and just think about,
maybe there's a particularly
bad to be a young day trader.
Okay? If that's the case,
there could be some sort of
interaction effect between
day trader and age.
What would have
happened here is that,
when we saw age we don't know
you're a day trader yet,
but when we see day trader, we
already know you're an age.
So we get the extra boost
from that interaction effect,
maybe it's particularly bad
to be young day trader.
If we were to reverse the order,
age will get the interaction
effect and day trader won't.
Okay? So there's potentially, n!
Factorial different ways
of allocating credit if
you were just to start throwing
things in one at a time
into the ordering.
>> If you are using the training
data to compute these marginals-
>> Yeah.
>> Or conditionals.
The more you condition
on past features,
I'm going to be slicing and dicing
my data in a way that
I'm going to have
few [inaudible] estimating
the conditional next.
>> Yeah.
>> How does that actually
contribute into the approach?
>> Yeah. Good question. There's
two ways to go about it.
Because what you're getting
at is the challenge
of being able to fully
estimate the whole
joint distribution,
because in order to compute
this exactly you would have to
know the full joint distribution
of the input features,
and then you could compute
all these conditional expectations.
In practice, I don't
think it's wise to
assume you can do that,
at least accurately.
So in practice, what happens is,
you often assume independence
between different input features
in order to calculate
the conditional expectation.
>> But then I'm imagining,
given that 20-year-old, what is
the probability of
the day trader, right?
>> Well, so let's say, for
example, we go back here.
So we've introduced age.
Now, what makes this tricky is if
age is conditionally dependent
with other features,
that would make it hard in order
to impute all the other features.
But if I assume independence
between x1 and all the other xs,
I can simply run an expectation
over just those terms.
Now, if I assume independence
between x1 and x2,
I can do the exact same thing
again. Does that make sense?
>> Yeah, I'm just
trying to understand
the assumption because I think
I'm hearing two different things.
I think I'm hearing that I can assume
complete independence and it will
complete this thing like that,
or I can go it sequentially,
where the ordering
will really matter.
I'm going to be using
my condition at the start,
but then I have actually a data
scarcity problem as I go forward.
For a given problem,
how do you decide which way you are?
Are you measuring
conditional independence between
the variables somehow or just
picking and assuming one
and going with that?
>> Yeah. So in this case here,
we'll compute this expectation
by simply taking the mean
over the dataset.
So just evaluate your model on
the entire training dataset,
take the mean, you're done.
In this set here,
what we can do is we can say,
John is 20 years old.
So one way to do this is like
a partial dependence approach.
You simply plug 20 in and then
plug for all the other values,
everything that's in your training
dataset and you'd get that.
Now, that makes
an independence assumption
between age and the other features,
and so that's an assumption
that we are making.
Then you can repeat that here,
where now you're fixing age to be 20,
employment to be day trader,
and then you sample from
the rest of the features.
Again, independent from these two.
>> There's no data scarcity problem
like this because you're
taking expectation f
of x, rather than x.
That is how the model is
going to be behaving.
So it can [inaudible] generate as
much data as you want [inaudible].
>> But then, do I need
to generate the model?
>> Yes, you do need the model.
You don't need the generative
model to input data,
but you do need the ability to
re-evaluate the model you're
explaining because that's
going to gain your new labels.
>> Okay.
>> Yeah.
>> So it seems to me like in
terms of human exploitability,
making the economic order,
in some sense spoils some of
the charm of an additive model,
which is that I can think about
all the different summed
contributions
independently in
informal sense, right?
>> Right, yes.
>> Particularly I'm thinking
about the actionability
for poor John here.
So John gets denied and
wants to know what can I
do and maybe the last thing
on the list was life like
capital gains or something,
which is something I maybe
could do something about.
Well, I can't change my age,
changing my occupation is
a pretty big move for just getting
a loan but somethings [inaudible]
>> Maybe open accounts.
>> Change your- right, I could
open an account but now,
it's not just what's going to
happen if I open an account,
it's going to be what's happening
if I open an account given that I'm
20 and that's much
harder to think about.
>> Yeah, exactly. So hold on that
for just sec if you don't mind.
>> Just one more question.
>> Go ahead.
>> It seems like you're
assuming that in this case x1,
x2, x3 are conditionally
independent of everything else.
>> The other x's, yes.
>> The other x's, but you're
also assuming that x1, x2,
and x3 are somehow dependent
because the ordering matters.
>> Well, so the dependence is
driven by the nonlinearities
in the model,
not by the input distribution.
In the sense that, let's
imagine I have an nth function,
the order will matter even
if my input features are
independent because whenever I
see the second one
on my nth function,
that's going to change my model.
So the only way that the ordering
will not matter is if
all my input features are
independent and my model is linear.
So non-linear models are what
we care about explaining,
and so that's why the order matters.
Which gets back to what
you're saying about,
"I don't want to have
a reason about orderings
when I'm looking at an explanation."
There are many ways to
do the ordering is not
even a good way to pick
the right ordering necessarily.
So I think this is a good place
to actually step back and ask,
have other people asked
this question before?
Have other people asked, "Is
there a good way to allocate
responsibility among a set of
inputs to a function for
the output of that function?"
It turns out that,
perhaps not surprisingly, there have
been people who thought about that.
If you look back in 1950s,
there's a guy named Lloyd Shapley,
who worked on this in the context
of cooperative game theory.
There the idea was to say,
we have a set of
players that are coming
together and they're going
to play a cooperative game.
The output of this game is going
to be let's say some money,
and I need to divide that money
among the players in a fair way,
but these people they
interact differently.
So some people clearly
deserve more than others,
how can I fairly divide that money?
It turns out you can extend that,
it doesn't have to just be
positive values like money,
it can just be any output
of the function.
When you do that, you
can put that in a variety of ways but
the way Lloyd approached
it was he said,
"Let me write down
some properties about
fairness that I think should hold.
If I write down it turns out only
a few properties about fairness,
there's only one way
to divide up the money
such that these fairness properties
are not violated."
That I think it's really compelling.
He did a lot of good work on this
and a bunch of other things.
He got a Nobel Prize in 2012.
So this is like an economics thing,
but these are very
very well known values
called the Shapley values now,
that are the unique solution
to these fairness properties.
That's what gets at how do we get
around this whole
dependence on ordering?
So what are these properties,
first of all, that you wrote down?
These are actually
some later properties updated in
the '80s because they're a bit
easier to connect with
machine learning,
but one of them is called local
accuracy or additivity
from game theory.
Essentially, it's
pretty straightforward.
It just says, "You're setting
a set of feature attributions,
I want it to sum up from
the base value to
the output of the model."
So if you're giving
me some summation,
I would like the sum of
these local feature attributions,
the sum of these Phi values
to equal this to this.
The thing we set out to start at
the beginning is to
explain that difference.
So it's a natural property
we'd like to hold.
I mean, you can certainly
violate it, but
let's assume we would like this
to hold for our explanation.
Then, the second property
here is called
consistency or monotonicity
in game theory.
For this one,
essentially what it says
is that if you were to have a model,
and then you're going to
reach in and make that
model depend more on a feature,
then your attribution for that
future shouldn't go the other way.
It shouldn't actually decrease.
This is really important because
if you violate consistency,
it means you can't trust any
of the feature orderings
in your attributions even
within the same model.
So really this is a core like,
don't be wrong kind of axiom.
Then there are a couple of others
that are basically trivial
and they're always like,
"I'm allowed to swap input features,
zero means zero, and
things like that."
So these are the two
fundamental properties that
really define these values.
What are these values?
Well, they're easy
to say, very easy to
specify that the Shapley
values result from averaging,
exactly what I just
told you to do except
over all nth factorial
possible ordering.
So for every ordering, I'm going
to get nth factorial Phi1,
whenever I introduce
the first feature
in whatever position, and
then I just average those.
So it's very simple to say but of
course very painful to compute.
You wouldn't actually want
to enumerate all n factorial
possible orderings in
order to get this result.
So it's natural to ask the question,
maybe there's a way to do
this faster, quicker, easier.
How do we do this? Well,
it turns out it's NP-hard.
So it's a couple of options
we have for NP-hard problems.
We're really motivated perhaps
by the theory and we
really wanted to hold but,
how do we approach this
to solve this for ML?
Well, one, of course, is to prove
that P equals NP that's one option.
The other is to find
an approximate solution.
I really thought this
would help our nips paper
a lot and the impact but no.
So we focused on
an approximate solutions.
In reality, an approximate solution
is pretty trivial if you
just look at the problem.
It's a n factorial thing,
I'm taking a mean,
just draw at random samples from
those number of permutations.
That's exactly what
previous methods have done.
Basically, they say we have
a certain number of permutations,
let's just draw maybe 100 or 1,000
permutations for each feature
and then we'll take an average.
But what we're going to do
here is say, we can do better.
How can we do better? By looking at
the unification that
we just talked about,
and trying to draw strings
from some other methods.
In particular, there are connections
with all of these methods,
but I'm going to focus on
the connection with Lime,
and talk about how we use some
of the advantages in Lime
to improve our ability to
estimate these classic values
and game theory.
So maybe you guys are
familiar with this,
but if you don't remember what
Marco wrote in his paper,
he's got a minimization function.
It looks like this,
where essentially you're
trying to fit a linear model
locally to your original model
that you're trying to explain.
It's not important that details here,
but what is important
is that you have
parameters you need to choose.
You have to pick a loss function
that pushes these things
close to each other.
You have to pick some regularizer,
and most importantly have
to pick a local kernel
that defines what local means.
Perhaps naturally, you
just choose these things
heuristically when
you're trying to build
this type of explanation method.
But if we go back and
we say now we know
that for dealing with
toggled type inputs,
then we're going to be inside
this class and that means that
all these parameters are forced.
There's only one answer
that matches these axioms
so that means there
must be only one set
of parameters that we
can choose in order
to maintain our local accuracy
and our consistency.
So that's a
non-constructive statement.
But well, this is a difference
between Mac and Windows.
There is a formula that we
derived we're simultaneously checking
consistency between Mac and
Windows PowerPoint here.
So this formula here is
essentially a specific local
waiting kernel that we derived,
that allowed us to estimate the
Shapley values in a new way.
So these are values
that have been around
for better part of a century,
but now we can estimate them
using linear regression,
which is fundamentally different than
just random sampling
from a permutation.
So that's cool because these values
have come with a lot of
important properties and
people care about them a lot
not just in ML but elsewhere.
So what's nice about this is that,
it's not just a new way, but
it's a way that can be helpful.
So here, you can see that
permutation sampling,
so just drawing from a permutation
has fairly high variance.
So here I'm just drawing
more and more and more permutations,
which involves
evaluating the original model
more and more and more.
So it's computationally
painful to do this,
and this shows the standard
deviation of my estimate over time.
I have right here the right answer
for two different models.
So this is an individual feature
we're estimating a value for.
So permutation sampling
has a high variance.
If you look at Lime,
it has a much lower variance due
to its regression formulation,
but of course, it's not
converging to the values that
we would like based on these axioms.
So what we're able to do with
Shapley is actually have
our cake and eat it too.
So we get to keep the low variance of
a regression-based formulation but
also converge to the values we want,
so we get to keep
the axiomatic agreement.
>> How many features are
there in the [inaudible]?
>> Yeah, good question. So each of
these is just looking at
the convergence of a single feature.
This is looking at
the convergence in a dense model.
This is looking at the convergence
in a sparse model.
Because in a sparse model,
you're allowed to use regularization
to improve the
convergence like lasso,
for example, for regression
and lots of tools
of regression that allow you
to have better sample power.
So this is demonstrating that.
This is just a dense model
where we don't get to use
any regularization but we
still get a lower variance.
So in this case,
I think there was about 20 features
and about this one
there was maybe 200.
So this is fun.
Maybe we'll stop here
and do a quick review of
where we've been through.
So we're going to talk about
contributions in all these areas,
but we just talked about in theory
a unification of explanation methods.
So we just talked
about a wide number of
explanation methods that were
in the literature and how
they all connect together
because they have
the same type of explanation
that they produce as an output.
Then we talked about some strong
uniqueness results that now apply
to the whole class and allow
us to go into where we
used to make heuristic choices
and instead have
theoretical basis for choosing
parameter such that we obey
these classic Shapley axioms.
Then in practice, this led to
a new estimation method for
these classic Shapley values.
It has lower variance and hence
require less computation in order to
explain individual
predictions from the models.
At this point, I want us to move
over to an application
of these techniques.
This is actually the application
that motivated all of this
work in the first place.
So told the story in
reverse if you will.
The application is anesthesia safety,
so several doctors
came to my advisor and
myself and they wanted to
improve the safety of anesthesia
through machine learning.
I don't know if you guys
have ever stopped
to think about it but anesthesia is
this wonderful thing that's
only been around for maybe 150,
200 years or so and in
wide broadly available sense.
We've made a lot of progress but
there are still
more safety things to happen.
It's still not
a perfectly risk-free event
to keep you asleep but not dead.
So why would ML help here?
Well, the operating room,
it turns out, is
a very data rich environment.
It could be one of the places
you get the most data recorded
about you and a health sense in
your entire life if you go
into an operating room.
There's tons of high-frequency
measurements from lots and
lots of sensors hooked up
to you over a time series.
Of course there's a lot of background
medical data that comes in as well.
So the anesthesiologists
who came to us said,
"If you could predict
adverse events just before
they happen that would
allow us to be proactive."
Being proactive would help them
better manage the patient.
So the particular adverse event
they cared about here as
hypoxemia where this
is low blood oxygen.
So essentially, this used to
be the number one cause of
anesthesia-related death
about 50 years ago.
But then these things got
pulse oximetry got invented.
Now, you can actually see where
the status of the patient is
in terms of their hypoxemia or
their blood oxygen content.
That just totally revolutionized
the safety of anesthesia.
But it's still retrospective,
it's still looking backwards.
So if we could just look maybe
five minutes in the future and guess,
"Will you have hypoxemia
in five-minutes?"
They can make less severe
interventions
during the care of
patients in the hospital.
So we built a system called
prescience that predicts hypoxemia.
Again, five minutes is the number
that the doctors gave us is
like this is about when the time
range we'd like to know.
Then how does it work?
Well, so it predicts hypoxemia
by taking a variety of inputs.
So we have text data,
numeric data like age,
categorical data and lots of
other all the different types of
data input you would think
of except for images.
Then we also have a bunch of
time series data coming in.
It's minute-by-minute data
coming in from U-Dub and
Harborview here over
the course of several years.
So we have about
eight million minutes
of training data that we get
to feed into this model.
So a large number of minutes.
Then out of the model,
we're predicting
at every minute the
risk of the patient.
So at this time, we're
predicting odds ratios
for the patient over time.
This is forward in time.
So this is five minutes,
negative five minutes, negative
10 minutes in the past.
You can see their risk
is going up and down
over time where the risk is,
"Will they fall into this
desaturation region?"
So this is the saturation of O2.
It's pretty high and
then it dropped here.
So this is an actual
desaturation event.
Then we're here and the question
is are you going to have
another desaturation of it or not?
This is from an actual patient
and the risk right now is 2.4.
>> Do you know of any intervention?
So when there was the first drop,
was their an intervention?
>> Here?
>> Yes.
>> I do not know in
this point from this.
>> [inaudible]
>> The doctors are
constantly intervening.
So that is a good question.
I mean in an OR, it's not like will
I have done interventions it's
what are the current interventions
because you're on a ventilator,
they're like seven knobs that
are set to certain settings.
I'll talk about some of
those knobs in just a moment.
But yeah, there are lots
of possible interventions.
There's lots of ongoing intervention.
So this is definitely
risk with respect to
the current standard of care is
definitely what we're predicting.
>> It's varying overtime.
>> It is varying over time, yeah.
So the prediction of
the current moment is
about twice the normal
likelihood and it
has recently gone up in
the last couple of minutes.
But just like John or John's boss,
the doctor has the same question.
You can see now why we got pushed
into model explainability.
They want to understand why.
Well, if you want explainability,
let's start with a linear model.
Maybe logistic regression
would be a great
simple get the job done model.
If we do that, we could produce
an explanation in an odds ratio form.
So here's 2.4, odds ratio
at the current moment.
We could express that as a product of
terms if we're doing
logistic regression.
So let's sort the term and we see
the biggest term is the fact
that they're overweight.
The biggest risk factor for
having breathing problems
or hypoxemia issues is overweight.
Now, that's something
I don't expect or hope
is not intervened on at
this moment in the hospital.
You're not going to
fix that instantly.
The next one however
is low tidal volume.
This is how much you're
breathing in and out.
So tidal volume is essentially
a knob sitting on the ventilator.
How much air are we putting
in and out of your lungs?
It turns out that there
was this spike here is
almost entirely driven by the
doctors setting of that knob.
Now, there are good
reasons to set the knob at
a low setting but this quantifies
the risk that goes into
it very explicitly.
That's what doctors
found very helpful,
much more helpful than just
staring at a number and
wondering what's wrong
in the room somewhere.
There's some things, of course,
that helped the patient
in this case their pulses
looking good.
Then there's thousands of
other features that are
pulled out of here all together
have a moderate effect.
So it turns out that at
each point there's going to be
a different set of features
but those features are going
to be a fairly small set
of features that really are
driving the risk at this moment.
But maybe we don't want to use
a linear model because it turns out,
if we plot the ROC curve,
if you're familiar with
this, higher is better.
Plot the ROC curve for the logistic
regression model we were just
talking about and you get
a reasonable performance at 0.86.
But then if we throw
a complex model in this case,
gradient boosted trees, at
this we get a significant jump
in performance.
This is a 15 percent jump and
the true positive rate if
we fix our FBR at 0.1.
So that's a very non-trivial change
in accuracy of the system.
We tried a lot of marginal
transformations to try and
make a linear Lasso work
better but we did not succeed.
So this left us with this question
of how do we best address
this trade-off?
Again, that's what pushed us
into explainability methods and
led us to this understanding of
unity and kept pulling that string.
So now, that we have
pulled that string,
we came back to this problem
and said, "All right.
How would we address this now that we
have this approach called SHAP?"
Well, how will we use these SHAP
values in the operating room?
Well, one way to convey
these values to a doctor for
example would be to start
with the base value of one.
So this is odds ratio,
this is normal typical risk,
and then we can show you
your current risk of 2.4.
What we can do is we can take all of
the features that are positive.
That is they are increasing
your risk and we can plot them where
the width of the bar is the impact
of that feature on your risk.
So tidal volume has this much
impact, height, and weight.
We can sort them and you
can see that there's
almost all of them have
very little weight because there's
thousands of features here and
stacked against that we can see
features lowering your risks.
So the sum of the
purple plus the green,
pushing against each other equals
of course the output 2.4.
>> I find it interesting that
this presentation doesn't
make any distinction between
time-varying measures
and these kind of like
BMI and other things that are
going to be the same throughout.
>> Yeah, that's true.
>> Those are in
the sense not actionable
so I don't know
what the right thing to do is
because I don't engage with doctors,
but I would think that a distinction
would be interesting to that.
>> So there's two ways to at
least that we thought about that.
One was do we want to
actually go back into
the time series and try and
figure out where in the time
series we're getting the signal.
Because in a time
varying thing, maybe
your tidal volume is actually
really a time series of things.
So you could maybe try and tell
me what part of it we went into.
Some of the people in my lab
are actually working on that
as an extension but we
didn't do that right away.
In terms of which ones are
intervenable and which ones are not,
this is something that I'll
talk about the user studies
that we did do on this.
But I think there's still
more questions to tease out there
because doctors almost instantly
recognize that right
away when they see it.
So I can of course label
which ones are affecting
me and which ones are not but
they found it nice to know,
because they also in their head
have a risk model and they would
like to see even the
unintervenable ones that
they have in their head,
they want to also see here.
>> I can imagine those serving
as landmarks or something,
they might even be like thresholds
relative to your BMI risk,
you just went above your BMI risk.
>> Yeah, that's a really good point.
Yeah, exactly because the
fact that your tidal volume
is at equal risk to the fact
that you're overweight,
suddenly, it's like
something where well
I've pretty quantified as a doctor.
I know how much I think
this weight is affecting you.
My turn to that knob is now equal
to that? That's interesting.
>> This model you're focusing
on a particular hypoxemia.
But if I change you'll
see the title volume,
that's when they have effects on
other measures of
the likelihood of success.
How do you even to bring those in?
So I love this it's
precondition on the fact that I'm
falling one thing and
I care about that.
>> Yeah, that's
an excellent question.
Essentially, it's saying
how do you best measure
a whole handful of outcomes or
dozens of outcomes potentially
that you constantly care about.
I think we thought about that
and said this is like I agree,
that's totally where
you want to be when
you're in a hospital setting
because you don't want them
myopically affecting one risk and
totally messing up another one.
>> [inaudible]
>> Yeah. So I
could hypothesize how best
to display that to doctors,
but I haven't actually tested that.
So I'm not sure it
would be helpful for me
to go at it except for the fact
that of course you can run many of
these models or train
one multitask model if you prefer.
Then you're going to have lots and
lots of these things showing up.
I think, at some point,
the real important point is
people are not going to be
staring at this all day probably.
At some point, you're going to raise
some warning or bring
their attention to something.
I think that should
probably be one of
a whole panel of things
that you're interested in.
At some point, one of those things in
the panel is going to light up
and you're going to
go and look at it.
At that point, we were trying to
say we want to explain at least what
the model was concerned about at
that moment so you can
warn a number, yeah.
>> I could have framed the problem.
>> Okay.
>> So right now you framing it as,
"There is a risk factor
via the thing."
What are the features,
both actionable and un-actionable,
that explain why I'm predicting
this risk to be like this?
I could have framed the problem as,
"We are at this risk,
my goal is reducing this risk."
Give me interventions.
Give me concrete things I can
play with so that I
can reduce this risk.
Because you can't give
me an explanation like
this where there is nothing
for me that is actionable.
What do you think as the [inaudible]
between framing the question as
pure explanation versus framing
the question greatly targeted
on the intervention?
>> Yeah. So what you're asking is
obviously what the doctor wants.
I think the issue is oftentimes
we don't have that to give them,
in the sense that this
isn't a causal model.
It's trained on historical data,
it's full of confounding,
and that's actually one
of the things going
forward that I think is
a lot of open work here.
I think in order to properly
interpret this model,
you really probably have to have
some sort of causal
graphical model sitting
in your head about where or how
all these variables
relate to one another.
You have to actually
sit down in front of
this with experts before
it ever goes into
the operating room and look and see
like peak pressure actually helps
people but you found it
hurt people because of
an association with some confounder
that you didn't think about.
That's all stuff that needs to be
sorted out before it goes
into the OR and it has to be
done beforehand because you
can't expect people to
do that on the fly.
But that being said,
I'm not sure where the boundary is
but I'm hesitant to hide too
much of the actual association.
I would rather probably tell
them I don't know how
to visualize this but I
really think that it's
future work and how to
integrate causal modeling
with this to help
people not make causal claims
about things that are not causal.
But I guess what we came
away with was oftentimes,
something could go wrong
but doctors are there for
six hours in some of
these surgeries and
there can be really simple obvious
things that would be obvious if
they were paying
a 100 percent attention
for six hours but they
aren't otherwise.
So just seeing why the algorithm
is concerned can get
you a long way there but it's not
going to get you to causality.
So I think that if we wanted
to answer the causal question,
we would actually have
to be in the loop
in order for me to trust it at least.
We'd have to be in the loop,
and there would have to
be some understanding
of what's randomized and what isn't.
So from our perspective,
we said, "Well,
how far can we get with
an observational study?"
That's why we're out here. Does
that answer your question?
>> Yeah. That answers my questions,
just that there has been
this more recent line of work.
I'm sure you know about
counterfactual explanations and
the actionable explanations
and having found the
absolute [inaudible]
and I think the only insight
they believe is that,
"I just want to be as my explanation
is on things that are actionable."
In my mind also I don't
have an answer as you do.
I'm just trying to do the trade
off between giving this kind of
an explanation versus giving like
a counterfactual actionable,
whatever that is, and
it will be, I guess,
good to do the same
observational studies
with professionals to see how
they're interpreting them.
>> Yeah. No I agree. I guess
the one thing we want
to avoid is presenting
things that are
observational as though they
were true counterfactual things.
>> I think there's [inaudible]
that they look at this.
>> I agree, yeah.
>> They're going to look at
the tidal volume and say, "Oh, yeah.
I'll drop this and then I should
expect to see this risk go down.
>> My question's along those lines
too but perhaps maybe
it's an easier question.
So let's make the assumption
that this model is
a perfect model and
it truly represents the real world
in every possible way.
I'm wondering, maybe I just
missed this in the presentation
earlier and processes,
how simulatable is this as a doctor?
Like, if I know the kind of
Shapley's score I get from
tidal volume and I know that
the volume is two right now,
can I make assumptions about what
the model would predict if I
dropped the tidal volume to one?
So I've got a black-box model,
I see a score like an attribution
volume two, and I say, "Okay.
I can turn this knob down to one."
Do I know how the model will change?
Is that valid?
>> There's two questions there.
One I'm going to postpone to later
in the talk where we actually
try and understand the
individual features one at a time,
which this plot does not tell
you. That's the first thing.
Just by looking at
this all you know is
what the current value is doing.
However, you can draw
curves with these things.
First you could of course do
a partial dependence box,
just literally dragged
the curve and see it,
but it also turns out that you
can make interesting plots
using lots and lots of
these explanations and
that's what I'll talk
about a little bit later.
So maybe ask that question
again when you see
that and then tell me
if that answers it.
>> Yeah. I guess I have
one other question too
about taking base values.
So I guess like the guidance is
to pick it based on your
training data set but
if the people I'm evaluating on
are different than the people that
I've seen in the training set,
is there any problems there?
>> Yeah. So for example,
let's go to a financial case.
It might make a lot more sense to
use the background distribution,
not to be your training
data set but to
be the distribution of
people who got a loan.
Now everything is with respect
to people who got loans.
Here is why I didn't get a loan.
So I think it really depends
on what is the reference frame
because if you're sitting in
a room as a doctor and
someone comes in and says,
"I have a fever," and your risk
for something goes up,
that's because you mentally
have some background model in
your head of what the people
typically come into your office.
If you live in the jungle you
might have a different
background distribution than
someone who lives in Alaska.
But you should think
the same way when you're
building this algorithm.
It's like the person has
something in their mind.
You'd like your background
distribution to
match their mental model
as much as possible.
All right. I'm going to
move pretty quickly.
I think 11:45, is that
when we're trying
to wrap up a room somewhere in there?
I'm not sure the exact timing.
>> I think you have until 12
but you're getting
a lot of questions.
>> Okay. I'll aim 11:45 and do that.
So let's move quickly here.
So we have this plot explaining
an individual prediction at
individual point in time.
If you rotate it 90 degrees and
then you can run it over time,
you get things that look like this.
So essentially, this is
going forward in time.
This is the current time,
so this is actually
this slice right here
and remember this junction
is the current predictions.
So you can look at this as
the prediction and the junction
between green and purple over time,
and we can see the area
of different features
changing throughout
the course of the surgery.
In fact, tidal volume
is actually something
that expands and then goes away,
and just in this little
piece right here.
So that tells us it's
a transient effect
at this point in time
due to the model.
So this is nice to see the flow of
risk of a patient overtime
during a surgery.
>> Can I ask a question about this?
>> Yeah.
>> So if you are
explaining a prediction?
>> That's right.
>> Are you explaining
a prediction a point in time
because when I create this
explainable models for each point,
the model may take
different features?
>> That's correct.
>> It's representative.
So barring effects,
which features are meaningful?
Because, for example,
tidal volume may go away and
something else may come in.
>> Yeah. So if I'm looking at
an individual prediction,
I just sort them.
So tidal volume maybe here
because it's really big but
over here it's going
to be buried somewhere
in this tail, so you'll never see it.
>> But I mean the different features.
Does the explanation use
the same features or?
>> Yeah. So this is all features.
All the features are here.
Just some of them have
larger widths than others
because they're more important
at that moment than others.
>> So in this sort order here is
based on some sort of average?
>> This sort of here?
>> Yeah. Are we doing
crossovers [inaudible]?
>> No. There's no crossovers here.
So I just sort by
their global importance,
and then for title line it may be
farther down in the bundle
that it pops out.
That is to roughly put like
most of them tied to the tails.
>> So if I have two features that
I believe are highly correlated,
I may randomly pick
one over the other at
any given time and I can't see all
of these things
fluctuating very much?
They are actually
representing the same thing?
>> Is that possible?
>> Well, what you're explaining here
is what the models
are actually using.
So that would only fluctuate
if the model itself was
actually randomly at each time
point picking different features.
>> But if I have two features,
they're identical, they're
giving me the same message.
Can I really ensure that I'm
going to be- I'm trying to
understand that [inaudible]
I'm giving it the right rate.
>> Essentially, you have to think
about like this is a model,
that is the function
that is the model making
a prediction and instead of inputs
let's say two of them
are very correlated,
and I want to know how much do
those inputs matter for my output.
That depends a lot of a function,
so it's a lasso, maybe it only picked
one and ignore the other one.
In which case you'll
never see one of them
in this plot because the model
doesn't actually touch it.
Maybe it was a ridged model,
ridge regression.
In that case, it will spread them
out equally perhaps between the two,
and then you would
have seen both doing
all the same things together
throughout the whole plot.
But in order for them to swap,
the model itself would have to
change from one time
point to another.
It would have to have
some sort of like if this,
then that kind of statement and
that would be very unlikely.
>> I think I need to understand how.
So the predictor model itself
is a black box at this point?
>> Sure, it's just a
function you can evaluate.
>> So I think I'm questioning
how the explanation can
really understand the internal logic
of the model because things
are getting highly correlated.
>> It's essentially by breaking
that correlation, perturbing
them independently.
That's deep into it I guess.
I don't know if we should go into
that conversation at the moment,
but essentially it's basically
perturbing the inputs to
the model to understand.
That's how all black-box models
have to understand the
function because there's
no other way to look
inside it without
making some assumption
about the internals.
But ultimately that's
what's going to happen.
So that's how it's
going to perturb what's
going on and then assuming
that your model is consistent,
you'll get consistent
explanations across time.
So we tested how this
worked by going back
and replaying historical data
to set up anesthesiologists,
practicing anesthesiologists
that you'd have in
her view and you did in
Children's Hospital.
We showed them cases both with and
without assistance by
prescience and then ask
them to anticipate on a scale of
1-100 risk in the next five minutes.
So that allowed us to make
ROC curves not just for prescience,
but also for the doctors
and not just doctors,
but doctors with and without
help with prescience.
So if we do that,
we can do is run that for
all five doctors and then
average their ROC curves to get
the solid green curve here,
and we can do the same
thing when they were
assisted by prescience
and get a blue curve.
It's hard to say exactly what
the false positive rate of a doctor
is because they don't
have that in their head,
but let's assume that they
anticipate 15 percent.
Then that provides
a very significant improvement
in their ability to anticipate
which we were encouraged by.
>> Do we really find
individual differences that
below false positive rates?
>> Yeah. Exactly, and that's because
once you get low and lower
false positive rates,
we didn't want to
kill off the doctors
by asking them to run forever.
Yeah. So that's why we highlighted
these things here and tried to pick
an FPR high enough that we are
confident with this distance
of separation,
which is significant in that case.
This is something we wrote
up, you can read about
more in Nature BME.
I'm going to move fairly quickly here
to make it through
the rest of this stuff,
but there's still room
for improvement here.
Why is there room for improvement?
Because as we just talked about,
modeling agnostic methods
can only explain models by
perturbing their inputs and
observing how their output changes.
It's really the only way
that you can explain things,
if you don't make any assumptions
about what's inside them.
But how does that hurt us?
Well, it turns out if you want
to explain a whole data set,
it can be inconvenient to do this.
Let's imagine we're trying to explain
a whole data set just
with a very fast model.
This is going to be
actually boost and
inference on sets of
trees is very quick.
But here's what we're doing is having
simulated data and we're essentially
increasing the number of
features in our model,
so we can control that all of
these features really matter,
and then retraining
an extra boost model and then
we're explaining 10,000 predictions.
What you can see is
this is minutes of
run time and it's
just a linear increase as you
might expect the number of
features because I have to
do this for each feature
essentially perturb things
and this is the lower bound.
This Is just the time
to run the model,
so whatever estimation stuff you
do on top; it's an addition.
Here, we're talking a couple
hours by the time we're up to 90,
or really important features,
which is certainly doable
for whole data set of
10,000 predictions,
but it's it's unpleasant
and would totally
change the way a data scientist
does their workflow.
It's not that we were using
too many samples either.
This is the previous permutation
sampling approach that we've talked
about and then this is the new one,
the regression-based approach.
This lower variance, but still
a couple percent of the magnitude of
these bars are twiddling
around a little bit.
So it's not nontrivial.
You have to think about the
fact that there is noise,
there's this balancing act
that's going on.
So if we go back to what we
did for an NP hard problem,
there really is
a third option and that is
we can restrict
the problem definition.
So we don't have to solve this
for all possible functions,
we can maybe just solve it
for a class of functions,
which is very attractive
in the context of
machine learning because machine
learning models come in classes,
many of which are very popular.
So we decided to choose the class
of trees and this leads to
a a new estimation method for
these classic Shapley values
for tree-based machine
learning models.
So why did we pick trees?
Not just because we're using them,
but also because other people,
it turns out we're using them.
So this is Kaggle 2017,
they basically asked
data scientist at work,
what do you use for your
machine learning models?
Not surprisingly, logistic regression
is right up there at the top.
But then if you look at
it, it's Random Forests,
Decision Trees, Ensemble Models,
Gradient Boosted Machines.
All of these involve or are
totally dependent on trees,
so it's a tree-based
machine learning models.
So whenever you're thinking of
explaining a complex model,
people typically think
a non-linear model.
So that means almost all the
nonlinear models that are
deployed and used currently in
practice today are based on trees.
So if we can make significant
advances on this class of models,
that will impact a lot
of current applications.
So how will we do this?
Well, let's imagine we wanted to
explain it directly with no sampling.
What would we be facing?
We have T trees in
our ensemble, let's say,
and each of them has L leaves,
and then we have M input features.
So now we have M factorial because we
have all these permutations
we want to do,
and then we still have
computer expectations.
So that's another sample, however
many N you want to draw in order to
compute that expectation.
So that's factorial.
It turns out there's a way to
rephrase it as exponential,
but we're not really much closer
to track the ability yet.
But then now we can restrict
ourselves to the class of trees,
and it turns out
there's a fairly cool,
but involved algorithm that's
recursive method that I
won't go through today.
It's interesting because, of course,
the solution here depends
on an exponential number
of conditional expectations,
but they fall out in the trees
in such a way that we
can design an algorithm
that's TLD squared,
or D is the depth of the trees.
So this means TL is just linear time.
So that's number of nodes in
all the trees and then
D squared is the depth.
This polynomial-time run time
gives us the exact solutions,
just like you had done this,
which is very nice because
now we don't have any of
the problems we had before.
In fact, if we plot the run time of
our new method against the lower
bound for the agnostic methods,
it's like indistinguishable
from zero,
which is nice, and of course,
there's absolutely
no explanation variability.
So if you're a data scientist who's
thinking about explaining your model,
suddenly this thing
is done in seconds
and you never have to worry
about sampling variability,
which really changes how
people can use this stuff.
>> It turns out it matters too
because if think about trees,
you might think trees have
been around for a long time,
like we should know how to
explain trees by now,right?
They have been studied
for quite a while.
A lot of people have
looked into explaining
and understanding feature
importance in trees.
But almost always people are thinking
about global feature importance.
What is the importance of
this feature over my whole model,
over my whole training data set,
gain, gene importance,
permutation importance.
These are all things that are
global feature importance methods.
There's only one approach,
that we found
heuristically out their.
Literature to explain an
individual prediction from a tree,
which is what we need for John.
Where we meet in
our medical examples etc.
They're heuristic and
they're inconsistent.
So how do we mean inconsistent?
So this goes back to
those two properties.
We had local accuracy
and consistency.
Consistency, is one of
those properties you want to hold.
Here's an example of when it fails,
and this is from
the heuristic methods
we have for trees right now.
So here isn't set of AND functions.
So my model is very simple,
it's just an AND function.
Take some binary inputs,
all independent, and it just
says if they're all true one,
if they're all false zero.
A two-way, a three-way,
a four-way, all the way up
to 10 way and functions,
and then if I was to be consistent,
of course I'd have to allocate
credit fairly between them.
If my background is typically zero,
then I'm going to just
split up my 1.5 and 0.5
for two-thirds, fourths, fifths etc.
So this is like we know
the right answer because
it's such a simple model.
But then what we do is we say,
"Let's run the heuristic methods."
Well, we get something
very different.
It turns out that most
of the credit actually
goes to the leaves of the tree,
just the way that this method works.
It's actually very similar
to the gain method.
Essentially, as you go down the tree,
the purer and purer your group gets.
All sudden you get more and
more gains for getting it done.
So this is exactly the opposite
of what you want because
these are probably
the most important features
because you greedily
selected them for your tree-building.
But by the time you get to
like a depth four and function
almost no credit is going to the root
and almost all of its
ending up in the leaves.
So this is the definition
of inconsistency.
When we have our first tree shape,
of course it identically
produces exactly this,
because it comes with guarantees.
So this is one strong motivation for
using these values over
what we had before.
Another one is to
basically go back and say,
"Well there's lots of metrics
out there that we can use
to measure the performance of
a local explanation method."
Particularly, ones that are
assigning numbers to their inputs.
So Local feature attribution thing.
No individual metric is going
to be the right metric,
but there are many
ways to go about it.
So what we did is we defined
lots of metrics that,
I'm not going to go into
the details of all of them,
but they're all about
perturbing or removing
features into seeing,
did I get the right one?
Did I get, you said that it was
positive, I'll take it out.
Did you make thing
go down? Okay, etc.
Then, we're going to
plot that against lots
of explanation methods designed
specifically for trees because
we are not looking at
modeling gnostic ones.
Necessarily we're just looking
at the tree-based ones.
When we then look at
our exact explanations
for trees, we call three explainer,
we can see it consistently
performs well
across all these metrics in general,
which we found very encouraging both
for a decision tree
and a random forest,
and we also tested this
on grading boosted trees.
So I think this is a broad benchmark,
evaluation, means these axiomatic
things are not just nice.
Properties builds lead to
practical improvements in
the explanations you get.
Another thing that we
found actually surprising,
and we weren't actually
setting out to
do was you could of course take
local explanations and put them
back together to get a global one.
Because we take
the mean absolute value,
or if I'm explaining
the loss function,
I can just take the mean
and the loss, for example.
When we do that, we can
get a single number,
but now we were doing
this axiomatically
motivated attributions
for each individual explanation.
So when we put them back together,
it turns out we get better
feature selection power than
previous approaches
like gain permutation.
Gain is like gene importance,
if you're more familiar
with that term.
So these are the ways that people
do feature importance
for trees today,
which is fairly widely
used in a lot of domains.
So we've found it
pretty surprising that
this is a simulated data
set we pulled from
some other paper that was using
to check feature selection power,
and this is when the
interactions are minimum,
and you can see
reasonable jump both of
the decision tree and
with a random forest.
So we've found that
surprising and nice.
We've actually got better
global feature selection power
when we went back to global
from our local methods.
The last thing we checked is that
while these are explanations,
so they should probably
be consistent with
human intuition. How
can we measure that?
What we can do is go back to
the simple functions like AND,
OR and exclusive or things like this.
These are things people in theory if
explained well should
understand themselves.
They know how the whole thing works.
Then, you can ask them how
would you define credit?
How would you allow
fairly allocate credit
among the input features?
If you do that for an AND function,
unsurprisingly, this
is an AND function
mixed with some additive effects.
They fairly divide it
between fever and cough,
which is part of a story problem
that we used on Mechanical Turk
to get people to understand
what this model was doing.
Of course, we didn't tell it
was machine learning model.
Then, we can compare that to
the shaft values that we get,
which nicely lineup
with human intuition,
but then we can compare that
to the heuristic values
and get significant differences.
This significant difference
happens not just
for AND but for OR exclusive OR,
all that non-linear functions.
So it agrees when you have
a linear function but
otherwise we see significant deltas
between what we observed,
and a consensus from Mechanical
Turk studies on simple models,
which is another reason that
we felt we should prefer
these sharp eyes.
So, in review, for trees,
we've narrowed in on
particular type of models trees.
We have really fast exact
computation methods
that allow this to be very practical.
They come still with these
attractive theoretical guarantees
because you're not doing
any approximations now.
They have really good
performance across
a broad breadth of
explainable AI metrics.
Surprisingly, they actually improved
global feature selection
power which is fun,
and they have strong consistency
with human intuition.
So, using all those results,
we basically said, what
can we do with this?
How can we use these to
improve our ability to do
machine learning in practice?
So we built a set of
explainable AI tools to do that.
So, to explore this,
let's look at a couple data sets.
This one here is a classic data set
called NHANES ONE.
From the 1970s, it's a very classic,
that 20,000 people in
the United States were
given lots of standard medical tests,
and then 20 years later they
were followed up for mortality.
So we can train a cox
proportional hazards model
with graded boosted trees
in this case,
to predict mortality over
in the US in the 1970s.
So again, as I say,
we can take the mean
absolute value to
get a global feature importance.
If we do that, we can learn what's
the number one risk factor
for death in 1970s in
the United States and Drumroll
is age. Big surprise.
What was most surprising
to me at least in
magnitude was that the number
two killer was being a guy.
Then, after that comes,
the things that you might think
about in terms of blood pressure,
inflammation, body mass, and
a whole bunch of other things.
So this is what you would get if you
just explained the model today,
maybe not with
a theoretical guarantees,
but you would get
bar plots like this.
You just said got feature importance
from one of these things.
>> But now that we have explanations
for each individual sample,
we can do better than just
put a bar chart here.
Okay? Because one thing
this bar chart does,
is it conflates the prevalence of
an effect with the
magnitude of an effect.
Whenever you have
a global feature measure,
you have to come up with one number.
So you have to pick a trade-off and
somehow combine
these two things together.
That's really important
because we often want to
find rare high magnitude
effects in a population.
You could, of course, train lots
of models on lots of subsets
of your population and then look
at their global feature importance.
But now we can actually look at
each individual in the population,
look for groups of people that have
high magnitude even though
those maybe a small group of people.
So as an example,
let's go down to the
one at the bottom of
our list here, called blood protein.
If you look at this
you think, well maybe
the global importance is fairly low,
so maybe this is not important
factor for life expectancy.
What we can do over here is,
on the x axis we're going to plot
our feature impact or SHAP value.
So negative is good,
meaning it's lowering
your risk, positive is bad.
It's essentially years of life
in a linear sense on the scale.
What we can do is, we plot
a dot for every person,
because for every person
we're going to have
a value assigned to the
blood protein measurement,
like an attribution assigned to it.
Then, we're going to color those dots
by the actual blood proteins.
So we can see whether
it's high or low.
When we do that, if we do like
a little bee swarms style plot
where the dots pile
up to show density,
we can see a long tail
stretching off to the right
here of red dots.
So there's a pile up here rated zero,
which means for almost everybody
their blood protein
has no impact on their
risk for mortality,
at least with respect to
the general US population.
But like for that
particular person right
there blood protein is
extremely important,
it's probably
the most important factor
in their life expectancy perhaps.
These rare high magnitude effects
are things that pop out when you can
explain everything at an individual
level across the population.
Now, you can say well, of
course, we could do that before,
but it's really nice now
that we have these very
high-speed trustable
explanation approaches for
TRIZ because this makes it really
practical because we can
quickly explain whole data sets
and then plot them like this.
We could do this for
all the features.
If we do, we see interesting trends.
So here on the right side
you'll see there's lots of
ways to die young by
being way out of range.
Okay. So these are people with
very high blood pressure,
these are actually
very underweight people.
There's lots of reasons
to be out of range.
But in case you're looking around
and browsing articles online,
there's not a lot of ways to be
wildly out of range and
somehow live way longer.
So I would urge skepticism
except be young.
Okay? So you can be on
that I guess the best way.
But these are like
insights that come when we
tease apart prevalence
and magnitude of effect.
Okay. We can do that now that we've
explained individual samples.
But we also might want
to zoom in closer
on one of these features and
understand more about it.
So one way we could do
that is to literally
put that features
value on the x-axis,
like a partial dependence
plot would be.
But instead of scrubbing
things around on a PDP plot,
instead what we're
going to plot now is
the SHAP value for
that feature on the y-axis.
So for every person,
they're going to have
the impact of their value
and the actual value of
their systolic blood pressure.
We plot them we get this.
So this looks very much
like the standard curve of
risk you would get for a standard
systolic blood pressure stuff.
So here's a 120, higher 120 you start
going up and in
your risk of mortality.
But what you notice is that there's
dispersion here. All right.
There's not like a PDP plot
we just get a line.
Now there's a vertical dispersion
and this dispersion
is driven by interaction
effects in your model.
If you had a linear model, you would
never see any vertical dispersion.
But what this means is a lot
of people with
a blood pressure of 180.
But for that person, it's
less concerning than for
that person, and question is why?
Why is it more concerning for
someone than someone else?
Well, it turns out,
I don't have time to
go into the details,
but you can extend the ideas from
game theory about putting
attribution on individual features,
you can extend that to
interactions of any order,
but what we did is we implemented
it with this algorithm,
a high-speed way to compute
the interactions for
all pairwise interactions,
so now we're essentially
assigning credit not just
among like a vector of
features attributions,
but a matrix of feature attributions
where the diagonals
is the main effects
and the off-diagonal is
all your pairwise interactions.
When we do that, now we can
look in and see well what
are the interactions that are
here and what's driving that.
It turns out in this case is age.
So now we can color by age in order
to see what's the effect here.
We can see that this is
highlighting how early onset
high blood pressure is
much more dangerous than
late onset high blood pressure
in terms of your mortality risk.
We don't need to go into all the
medical reasoning behind that.
But I think this is something
that kind of highlights again
another example of when
you have lots and lots
of local explanations that you trust,
you can pull out signal
from your data that you
might not otherwise have
seen from models that would have
otherwise been considered opaque,
in this case, like depth six decision
trees several 1,000 of them.
Now, I said we had
interaction effects.
So we could also plot a feature,
let's say age on the x-axis,
and here we can plot one of
the off-diagonal things in
our interaction matrix,
this time between age and sex.
If we do that we'll see
the varying impact of
being men versus women
over a lifetime.
So here we have risk being
fairly stable and then they
cross and peek at age 60,
which when we talked to doctors
who was strikingly similar to
when cardiovascular risk also peaks
which is predominantly affecting men.
It's not causal so I don't know
exactly what's driving it.
But these are the types of very
interesting interactions
that just pop
out from just plotting
many many local explanations,
in this case, from a GBM tree.
>> Why does it go down
from [inaudible]?
>> Well, what's happening
here is remember this is
the off-diagonal effects.
There's always a main effect.
It shows that women are
always better off than men.
But this is going to
of course be centered
because its main effect
is subtract the mean.
So by definition it's simply
showing the relative difference.
>> Yeah.
>> So once I know this one
I kind of know this one.
Last thing I want to end on here is
probably actually one of the more
fun applications of this stuff,
and it comes from when you have
machine-learning models and you
deploy them, it turns out they break.
I know you've never heard this,
but it turns out sometimes
they have problems like
most software, bugs show up,
features drift, someone
changed something in
the data pipeline that messed up the
way the thing got
deployed and no one knew.
Sometimes that costs a lot
of money depending on
what pipeline we're talking about.
So in a hospital that
doesn't cost money or
lives depending on what
what you're looking at.
So we're asking how
can we better improve
the safety profile of
ML models when they're used
in high-stakes decisions?
We can talk about
model monitoring in a hospital.
So one popular thing people
do with ML models in
hospitals is trying to
predict the duration of
things so they can do
better scheduling.
That's fairly benign task.
So here we're predicting
the duration of
procedures in a hospital over
the course of five years,
so this is data from
U-Dub and Harborview.
We use the first year
of data for training.
Okay. We trained to predict
based on all that kind
of static features,
how long is this procedure
going to take?
So what doctors, how long do
they need to be scheduled?
Then what you can see, of course,
is a natural jump in error,
once I do train test.
So in this is training error,
this is test error, it's
just naturally higher.
But if you just look
at this over time,
you're like well, I don't know if
my model is doing great or not.
But this is typically what people do,
they just look at the Loss of
my model over time and see,
is it going down? Is it going up?
You'd never see if
there was a problem
here unless it totally
destroyed your model.
To demonstrate that we actually
introduce a bug into the model here,
we actually went in and we changed
the codes of two rooms in a hospital.
So it's really easy change that
can happen in any data pipeline,
just swap the names
of those two rooms
and this training, this test.
Can you see where we
introduced the bug?
Of course, the answer is no, and you
could guess, what you
guess in your head,
but then what we can do
before I show you that,
I guess I already did, I
can't take it away now
it's already there.
>> What we do is we said, let's
explain the loss of the model.
So not the output of
the model but the loss.
If we explain the loss of
the model with these SHAP values,
what we're going to do
is we're going to take
that value which is the loss,
and we're going to allocate
it among the input features.
We're just going to say, for
each of these input features,
how much did you hurt or
help the loss of my model?
Of course, I can do that on
every single input prediction.
So I'll do that, that'll
essentially deconvolve
my loss backwards through
the model onto the input features,
and then I can look at
these individual input features
and room number six was one
of the ones that we swapped.
The question now is, so red,
I've colored the ones
where this room is on.
So they're in the room, and
blue is when they're not.
So it's not very
informative to know when
they aren't in the room because
there's lots of rooms
in the hospital.
But when they are in the room,
you can see it's typically
lowering the error of the model,
so it's got negative value.
That means it's lowering the loss.
Then all of a sudden here,
now it's hurting the loss.
So if you look up here,
you'd never know that
this bug had been introduced
into your model pipeline.
But if you look down here,
you can see a very clear
signal that pops up.
Importantly, you might
be able to find this by
just monitoring the input
statistics of your data,
because that's basically what
people do today from
my understanding,
in order to monitor models.
They look at this, and they look at
the input statistics of
your data over time.
But if these rooms had equal usages,
you would never see a change
in the input statistics,
their marginals look the same.
But the effect on the model's
prediction is quite dramatic.
Not in overall sense,
because there are many procedures
happening all over these hospitals,
but it's certainly hurting
a lot of the predictions,
hundreds if not thousands of
predictions are impacted by this.
So this, I think, is
really interesting
and could really be helpful in
basically taking
explainable AI and using it
to impact model
monitoring in practice.
Now, it's not just bugs
that we introduced.
So this is actually a Batch David
that was in theory
cleaned but of course,
data is never fully clean.
So what we found was,
we just plotted this for lots
of features and I just pulled
out a few examples
that were interesting.
Here's one where we plot,
are you under general anesthesia,
the flag in the electronic
medical record,
turns out somehow it was
wildly unhelpful for
the algorithm right in here,
for a subset of the rooms and
a subset of one of the hospitals.
That's something you would
never have found otherwise,
turns out we went back and we
found out it was because of
some transient EMR connection issue
between various things.
Exactly the thing that you would
want to fix if you were in
the middle of it because it's
hurting your prediction performance.
Here's one where we actually
observed drift over time.
So this is another binary feature
where we're saying,
"Are you undergoing
atrial fibrillation ablation?"
This is like a procedure where
they zap a part of your heart
in order to stop
aphid from happening.
We're trying to predict the duration,
and what you can see is a general
upward trend where particularly,
the training period but also
in the early test deployment,
we're helping things and by
the time we get to the end of it,
this procedure's duration
has actually changed,
and we went back to the hospital and
the cardiology department
and they were like, "Oh,
well actually,
different people came in,
and we've got new technology,
and we're much faster now."
Well, that's nice. That's great for
the patients but it's
bad for the ML model,
because now your feature has changed,
and again, the marginal statistic
one not changed at all.
So another example of how
you can use this stuff.
So to review, I guess I'm
going to wrap it up here.
In theory, we talked
about how we could
unify a variety of explanation
methods into a single class,
and that gave us insights
and how they are related.
We talked about strong uniqueness
results from game theory that now
spread to that whole class
that we connected together,
and how that can help
us pick new parameters.
It's not just the
theoretical perspective
but also impacts
practice because we have
a new way of estimating these values
through regression instead
of simply random sampling.
That applies for black-box models.
Then we also propose a new way of
estimating these in very high
speed way exactly for trees.
Based on these things, we
are then able to build
a whole set of explainable
AI tools that help you
build and monitor and understand
the models that you're building,
particularly if you're using
these trees, it's very convenient.
We talked about a variety
of applications,
the motivating one
that started us down
this track in anesthesia safety,
and also some work in
mortality risk and hospital
scheduling demonstrating how
these tools work in practice.
So I'd like to
highlight that you don't
have to take my word for it,
you can obviously try it
yourself. It's on GitHub.
It's also actually, it supports
and is directly integrated into
XGBoost like GBM which is
a Microsoft thing CatBoost,
and Scikit-learn.
These are all tree-based integrations
for this first tree model
implementing T++,
and there's a whole bunch
of stuff that we
did working on deep learning models.
Let's assume the classic deep
learning models, what can we do?
That I didn't talk about but we have
integrations of TensorFlow,
Keras and Py Torch.
We've also, as we
said we started with
hospital medical problems
that got us down this road,
but it turns out it got
used all over the place.
You stick things on GitHub, who
knows what they'll do with it.
Turns out you can help optimize
performance in sports teams.
Microsoft actually uses it in
their Cloud pipeline
right now as does Google.
Cleveland Clinic uses it for
identifying different
subtypes of cancers,
which I thought was cool,
close to what we started with.
It's been used for
optimizing manufacturing processes
for a jet engine stuff,
they build complicated
models and try to
figure out what's breaking and why?
I had a chance to work with
a very large international
financial banking stuff,
and this is really fun because
actually got to go in and
help them do augmented
intelligence stuff,
where we basically said they
have models and they have
people and at
a certain risk threshold,
they use people not models,
but you don't want to
lose everything that
the model had when you
transition over to people.
So I got to work with them for,
I did consulting with
them essentially to help
them use these values to help better
essentially collaborate
between humans and
machines for their decision-making
and financial risk prediction.
It's also been used for economics
research and a bunch of other stuff.
So I'm going to skip over well,
I want to skip over future work.
I'll skip over what's next but,
there's a lot of places to
go from here as you can
probably imagine, a
lot of fun things.
In theory there's a lot of fundamental
difference Interpretability
trade-offs that exist when you
have correlated features, okay.
So you touched on
this a little bit but
sometimes you have to
decide am I going to
violate my data or am
I going to understand what
my model actually did.
So there's infernal trade-offs,
that I think will be
really interesting.
There's lot of interesting
work that's going
on that I'm involved
with actually back in
our lab two and using explanation
constraints to guide model training,
it's a really a whole
new way of saying,
the model is doing something,
but do it this way, such that
your explanation is like this.
That's a really fun way
of doing things.
In practice, I think there's
a lot of work like
this model monitoring stuff
is fairly new and I think there's a
lot of things that
we could do there to
make that much more practical and
usable to the large number of people.
It's also, I think when
we're talking about actually
deploying stuff into hospitals,
like causal structure and assumptions
and if you have
causal modeling assumptions,
that's essentially expert
knowledge that needs to
be integrated into these systems
before you ever put
it in the hospitals,
so you don't expect
people will suddenly
think about how a confounding
could happen on the spot.
In applications, I always talked
about the financial one with
high-stakes decision making
hospitals that are very similar.
Same like, "I have an AI and I
am here and I need to make
a decision, how can you help me?
We're doing a lot of
interesting things understanding
adverse drug interactions
and stuff with
genomics and stuff with
understanding protein folding with,
these are all ongoing
collaborations here at U-Dub.
I think I talked about
the finance stuff.
So I'm going to skip over
this because I'm out of
time but all of this work of course
has been in close collaboration with
my advisors who in here at the U-Dub.
I've had the opportunity to
mentor a number younger grad
students in the PhD program here.
Also some in math, as well as the
MD-PhD program here at U-Dub.
So there are a lot of papers
that we're working on together,
both from my basic genomics
research to better
understanding time series data
to estimating drug-drug interactions,
I just talked about
that, even cancer stuff.
Of course, there's always
external collaborations
to make any medical stuff happen,
you can't do that on your own.
So we've worked with
some great anesthesiologists at
children's, U-Dub,
and Harborview as well as stuff
I didn't talk about at all,
it had to do with Kidney Research
Institute here in Seattle which
is done some great work there.
Cardiology, and then a bunch
of work that I didn't talk
about because of time but we had
fun things learning
large-scale graphical models
to understand genomics with some
people at University of Toronto.
So thanks.
