Hello I am Hussein Mozannar and Im going to present recent work in
collaboration with my advisor David Sontag on consistent estimators for the learning to
defer to an expert problem. To explain the
problem we are going to tackle today we
will take as example the task of
detecting pneumonia from just x-rays
this task  is currently performed by
trained radiologists who are given a
patient's x-ray and their medical
records will make a decision whether
this patient has pneumonia or not
however the collection of huge chest
x-ray data sets like CheXpert has
enabled deep learning algorithms to
perform as well and better than trained
radiologists at detecting various
observations from these x-rays however
these algorithms are only given access
to the x-ray and  not other historical
medical records. The question is now we
have these really good machine learning
algorithms how do we integrate them into
real-world applications such that we
achieve better performance than relying
solely on the expert or only on the
algorithm and in such a manner that we
ease the burden on the radiologist whose
time is costly. one possible solution to
this problem is to learn a module that given only
patients and put the x-ray comes the
decision to either the expert or the
model so that only one of them is
Prairie or the further module says to
defer decision to the expert then the
expert is queried and makes the final
decision on the other hand the model
allows the classifier to predict then
only the machine learning classifier is
buried and the expert is not involved in
making any of the decision regardless
patient this kind of interaction between
algorithm and human expert through
deferral has already found applications
and content moderation nor work aims to
give a theoretical foundation for future
applications in health care for example
in this work we formalized the learning
to the first thing and propose a novel
consistent surrogate laws for the
combined human and expert system which
is obtained through a deduction to
cost-sensitive learning this law settles
an open problem by me at all for finding
an assistant surrogate for multi-class
rejection learning we analyze previous
approach in literature from a consistent
people point of view and give a
generalization bound for minimizing the
empirical objective finally we provide
lethal experimental evaluation of our
method on various tasks this problem has
been tracking more and more attention
recently in 2018 Madras FL proposed a
mixture of experts laws however the
resulting loss is not consistent and
fails empirically rahu ed Todd proposed
a confidence score approach that
compares expert and algorithm confidence
however this approach does not allow the
classifier to adapt to the expert
weaknesses and strengths even more
recently the Ethel gave an approximate
algorithm for rich regression and
whether it welcome point combine the
mixture of expert loss with the
confidence score comparison this problem
is also a generalization of selective
classification and learning with a
reject option so let us start by
formalizing our problem Africa very
space X and the label space Y we want
during a rejecter r of X that is what we
will fill to is already ferrell module
that will decide with the two routes in
twitter classifier if the output of the
ejector is zero if the decision is to
route to their classifier classifier
predicts H of X and we couldn't
algorithm loss L it depends on the
Kaveri X labeled Y and I'll give a
prediction H of X on the other hand the
rejected defers by outputting 1 then the
expert is first given an additional
context she insist X for example the
decision context contain the medical
records in addition to the x-ray just
symbolized by X
the expert predicts M of Z and as a
whole in the system we incurred an
expert loss that depends on the varied X
label Y an actual expert addiction and
of Z our goal is jointly learn the
classifier H of X and rejecter R of X to
minimize the combined system loss which
is the classifier cost if we predict and
the expert cost if we defer the insight
or approach for this problem is from
reduction to cost-sensitive learning in
cost-sensitive learning we are given a
covariance the covariate and the
collector of cost C of dimension the
number of classes k plus 1 and the goal
is to pick the class among the 1 up to k
plus 1 that has the least cost C 1 up to
CK plus 1 so the reduction is
accomplished by taking the first state
classes to signify the label space Y and
the cost for these K classes is the cost
of the algorithm predicting each one of
these classes depending on the actual
corporate X and labeled Y and the cost
for the class ki plus 1 is now the class
of deferring to the expert all of X that
initially depends on the expert
prediction of Z so this is doctrine
means if we predict any of the first
stage assets it means the classifier has
faith or bit and if you predict the k
plus 1 class it means we are the 13th
decision to the expert and occurred the
expert loss so how do we solve
discussing the process of learning
problem well consists of learning as
realization of multi class learning and
we use this usual strategy of finding an
easily optimizable
surrogate loss for the true objective we
proposed a natural extension of the
Cross cementery loss with the setting
we're going to learn k plus 1 functions
G subscript I for I and Q plus 1 and
their classifier H of X is going to pick
the maximum of these
our surrogate laws that we introduce is
called tilde a C and it is the sum of
the negative log softmax of each GI
weighted by the difference between the
maximal cost among all Killzone classes
and the cost for a particular class GI
the introduction of the maximum is to
make sure that all terms and this loss
are positive and we minimize something
that is convex we have that this loss to
the FC is a consistent loss function
what this means is that if we minimize
this loss over all measurable functions
G the minimizer that we obtain will pick
the class that has the least expected
cost given the covariate X meaning we
pick the class minimizes the expectation
of C of Y given X so let us return to
our Leifer listening we will assume that
you have data over target why cover it X
as well as expert addictions and M of I
for each sample in our data set we make
the centralization of classifier and
expert cost as the misclassification
error of the target the objective is to
learn a classifier and the rejector
north achieve the smallest year wonder
for the combined system this is denoted
than the loss at 0 of 1 which is simply
the misclassification error of the
combined system as usual we will learn y
Capital y functions for each label in
our label space G subscript Y and our
classifier is going to pick the maximum
among these capital y functions
similarly we also learn an additional
function G subscript D and we define our
rejecter by by being one of G subscript
D is bigger than the maximum among all
of the rest of the Y function this is to
make things as a multi-class problem
that is augmented with the additional
class of the Pharaoh and now the circuit
previously introduced food as he becomes
the lost I see and this made of of two
terms the first term make sure that the
classified fits the target this is
reading the cover rose and appear chorus
and triple O's however when normalizing
by the additional function Jesus fruity
the second term only appears on examples
were the expert is correct when M
subscript I is equal to Y subscript I
and this term encourages the system to
defer to the expert instead of
predicting the target sown these set set
of examples they have an equal cost for
different or particular target as a
consequence of the previous proposition
we have that this loss is a consistent
estimator for the 0 1 loss and
furthermore we get is a convex upper
bound which is reassuring so as I just
mentioned this loss LCE has an equal
cost for deferring or drinking target on
examples with expertise correct which
found as a heuristic it is helpful to
its risk encouraged or discouraged
action of deferral by adding deviating
parameter alpha this parent could alpha
only appears on examples were the expert
is correct if the expert is correct
weary weight the classifier loss this is
the first term of the loss by alpha and
if not we we make sure that we are
predicting targets the referral so one
can think if alpha is zero on examples
where the expert is correct our only
choice is to defer and if and if the
expert is not correct then we have to be
predicting target moving on to
experimental evaluation will prevent
prevent
present an assertive experiment using
the c-4 dataset we will be classifying
system adjacent to ten classes and
parameterize our model G and the y
ResNet would 11 output layers the first
then our G subscript Y and the 11th unit
is for deferral
of trading we simulate multiple
synthetic experts as follows that gave
that KB an integer and one to ten then
if the image be launched with first K
classes the expert is perfect
otherwise the excerpted X uniformly at
random so we have a clean region of
expertise for the expert this is the
first K classes we compare approach to
three baselines the first is the mixture
of expert baseline of mothers at all the
second is the confluence approach of
rabbu at all and the third baseline with
its regional learned oracle tries to
capture it intuitive behavior in the
setting we should be deferring to the
expert on the first K classes and
predicting under ten months remaining
ten nights K classes and this baseline
tries to do exactly that with the
additional knowledge of T so on the Left
you're showing a plot of combined system
accuracy on the y-axis versus K the
number of classes that can predict so
each point here corresponds to a
particular expert game as we can see the
mixture of expert Lawson and learned
Oracle approach in the mixture of expert
loss first phase learn any behavior for
up to k equal to seven it learns to
never defer to the expert the earned
Oracle performs a bit better on higher K
but fails initially now the confidence
cover approach and blue aqua from these
two baselines and gives a consistent
type of behavior now we are showing our
approach for two different alphas 0.5
and 1 and red and blue respective and
air than black respectively and you can
see that our method dominates all of the
baselines across all values of K and
this gains are by 1 to 2 on average for
confidence and 2 to 3 on average for
learned Oracle to gain more insight into
the behavior of our method we plot on
the right the accuracy of the classifier
are nondifferent examples versus the
coverage we can see that our method not
only gets higher coverage meaning we are
predicting a higher percent of the time
but you also have a higher accuracy when
we do actually predict so giving the
results while we outperform the
baselines the first reason is that of
sample complexity in the figure on the
right we are plotting with the x-axis
the training dataset size and on the
y-axis democracy as we restrict the
training dates at size our gains over
the confidence baseline improve by a
factor of three and four the second
reason is that our method takes into
account classifier confidence their own
Oracle based on does not look at the
conference of the classifier and hence
suffers finally the third reason is that
of consistency the mixture of expert
based on is not consistent there is a
clear mismatch between the loss and the
actual error of the system further
experimental evidence in support of our
method is available now paper we have
different experiments on chest x-ray the
set content motivation of beads and
further results on c4 100 this method
has also been employed in urine
attractive action in our in a recent
paper from our lab in kV 2020 we also
have further results for inhalation
bounds for the 0 1 loss objective thank
you for listening and if you're
interested in our work please do not
hesitate to contact us and read out
thank you again
