get
started welcome to our or data science
seminar series and this is our first
seminar in spring 2020 and we are really
glad to have
dr al hero here today to come to
visit us
and al is the john h holland
distinguished university professor and
also
r jameson and betty williams professor
of engineering at the university of
michigan
and he was the founding co-director of
midas
stands for michigan institute for data
science
not the car repairing mech
and he is currently a section editor of
the newly created
science journal of mathematics of data
science
so al is undoubtedly national leader
over the broad landscape of data science
so uh we are very fortunate to have him
today
and let's give him a round of applause
welcome
can you hear me is this on uh can you
press that button
oh maybe it's on mute
is that better yeah i guess i hear an
echo so that must be better
unless this is just someone
well-designed here in the box can you
hear
me because if you're recording of course
you're going to need
this to be working
but i think it probably is great
well thank you for inviting me to uh
to texas a m i haven't been here in a
while last time was in 2013
and i've seen a lot of changes
and of course when i was here last it
was probably about
maybe 30 degrees hotter
this is more of my style temperature
coming from michigan
i'm going to talk to you today about
a different point of view
on machine learning ai which
focuses not on designing the next widget
the next gadget to classify
or design a recommended recommender
system
uh or a discriminator or estimated
variable
but rather to benchmark
what the best possible
accuracy that one can squeeze out of a
data set
is that's what benchmarking in the sense
that i'm using it
here right so we're not trying to
classify or test hypotheses
we're trying to get a handle on
what is the fundamental value
of the data if the value isn't
sufficiently high
given the benchmark that one extracts
directly from the data
then the experimenter the
designer the engineer
can conclude i need more data or i need
a different type of data to improve
this type of design
so i first have to acknowledge
the part of my group that attributed to
this
work over a period of about eight nine
years
uh starting with uh kumar sridharan
who's now into it and
most of this work sent today
contribution um and then some of my
sponsors of this work that i just i just
looked here
to say knowledge in them so we're going
to talk first about benchmarks
in machine learning and why one
should be interested in those uh this
leads
naturally to a problem that's been
around
for at least 20 30 years
which is something called divergence
estimation
so it's going to turn out that the
benchmarks that we're going to
prescribe depend upon our ability to
estimate something called a divergence
or a difference
between classes between class a
data and class b data which can be
directly related to
the accuracy as measured by
a probability of patient error
and then we'll talk about a way to
accelerate
divergence estimation using a method
called ensemble estimation
which is a analogous
technique to boosting
bagging for people that have taken
machine learning
using lots and lots of different
estimators of this benchmark combining
them together
like the meteorologists do and
predicting
the weather multiple model
aggregation and then we'll talk about a
few applications
uh ending with an application
in what's known today
as deep learning so
that's the schedule so what what do we
mean by a benchmark
well i just told you we're trying to
figure out
uh just how much information is
in a data set so we're going to focus on
classification so we have a couple of
data sets
one coming from maybe three here one
coming from
a red class a green class and a blue
class
and um one
look at the data the typical reaction is
oh let's let's design a classifier and
see how well it does
that'll tell us if we need more data or
not or if we have the right kind of work
the classification of a
new point new data point that comes in
uh featured two dimensions right lies on
the plane
if you measure from an accelerometer
let's say
or an optical signal
it falls into this point between those
classes
and the k nearest neighbor classifier
will classify the
that point as having uh a class
equal to the majority
class of its nearest neighbors
so that's a classifier of considerable
uh history that goes back
uh at least to the
50s when nearest neighbors were
used as in the perceptron
the support vector machine classifier
which was
you know the great revolution
vapnik uh back in the
early 90s that created
this paradigm of taking the data
in your feature space of two dimensions
and
expanding that data out to multiple
dimensions really high dimensions
perhaps to get rid of any
non-linearities
that might occur and this non-linear
embedding then gives you a feature space
which
you can define a linear separating
hyperplane
to classify the data and again you can
figure out what the
accuracy is right that classifier
and then today we have deep networks
right that
are taking the world by storm in
uh nai and they're being applied to
all kinds of products and classification
and learning
and again then one can implement a
classifier
a cnn convolutional neural net and
look at how well it performs it could be
overfitting
it could be a mismatch um
but in the end every
conclusion that we come up with about
how well any one of these does
is intrinsically dependent upon the
approach
that you take to define
the mapping of your data
to a class whether it be caterers near
majority vote or if you're above or
below a hyperplane
or if you're the discriminator output of
a
deep learning network is uh
gives you a probability greater or less
than 0.5
the problem we're one of we want to
solve is
looking for the accuracy
directly without having to introduce
a new um
machine right to uh to supplement these
machines we've implemented that
perhaps are giving us increasingly
improved
uh accuracy because we're increasing the
complexity the intrinsic complexity of
each one of these
machines is increasing over uh
the canon classifier svm and cnn
so uh recently there was a a new
paradigm for classification
called um the moth net
which is yet another paradigm
that uses the um
the sensory receptors of a moth
as a model for building networks
neural networks um is that going to
create a new revolution i don't know
but um according to this paper the moth
net does
uh considerably better than
previous methods so we want to sort of
get some um leverage on this
rat race right of developing
newer and and more complex and maybe
more bio-inspired ideas such as moth
nets
and go directly at this
we have a time
which you can uh look at as
the number of samples available from
from on maybe on the average from all
the classes
and we have the achievable accuracy that
one can
observe either by a human trying to do
the classification by hand
on a few examples by the latest machine
learning algorithm
mothnet or some other thing that
comes out and the bay's
optimal accuracy is
an accuracy that is
an upper bound on what anyone
can construct either human or machine in
terms
of achieving the
fundamental limits right of the best
possible classifier
probability of error and so that's our
benchmark that red
dotted line is the benchmark we're
trying to get at
correct all right so does it make sense
to to even conceive of getting at a
benchmark directly without defining
an algorithmic model for
classified well this example is one
example which shows
that one can without
any assumptions on the
structure the classifier directly
get at the classifiability
of a two-class problem this is a
a technique that's um
due to friedman and rafsky to to
statisticians
uh stanford uh statisticians um
three years ago developed this uh
friedman raspy statistic which is a
generalization of
a run's length statistic in one
dimension
to several dimensions to try to
discriminate between
the um
the hypotheses that uh
the green points of the blue points
belong to the same distribution versus
they come from different distributions
so here's pretty obvious it's probably
come from two distributions right
these features uh in the blue and the
green
are separated uh and so one would
assume that there's with this number of
uh realizations uh 50 realizations that
you would at least get some
uh interdigitation right of these two
data sets if they were coming from
the same distribution and uh on this in
this case on the other hand
you know maybe they do maybe they don't
but what is certain
is that if you build a minimal spanning
tree
which uh those of you in or know what
that is
and most others should also have a
notion of what that is which is a
simply a graph that connects all these
points
in the feature space two dimensions
edges
uh have length equal to the equation
distance between
points you build the graph
that connects all the points together
but that has minimum
total length the sum of all the edges is
minimized
there's always going to be n minus one
edges if you have any points
in this green and blue
merged class and what this statistic
does is it says let's label all those
edges according to
dichotomy so if the edge connects
two red points or two blue points
uh or two green points together then
we're going to
eliminate that edge if it connects two
different points together
two different class points together
we're going to retain the edge and we'll
count the number of edges
dichotomous edges and what you can see
here
is that the number of dichotomous edges
is in an
intuitive sense a measure of
the difference
right between the two distributions
fewer edges
if the distributions are far apart than
if they're close together
and so what what is why is this relevant
well turns out
that this statistic this count of
dichotomous edges in the middle of
symmetry
converges as you let the number of
realizations draws from these two
classes get large
it converges to a
height bound on the basic
on the minimum probability
of misclassification
so we haven't defined an estimator or
classifier
structure we simply
looked at the points the data points and
build a graph on those and what we get
is an estimate of the benchmark
the accuracy fundamental accuracy
achievable by any classifier if you try
to classify these defenses
so that's that's the um
sort of example that motivates that this
is possible
you really don't need to build a
complicated or simple classifier
in order to be able to estimate
the accuracy limit
that restricts any particular
classifier's
performance the benchmark that is
so we'll start out with just some
a little bit of a mathematical
setting that uh describes
classification so we we define a label
zero green or blue
y which has some priors q and p
where q is the probability of zero p is
the probability of one
and uh the observed random variable has
a distribution that depends on the class
and this is just one way of writing
that distribution given that the class
is class k
where k is under the zero class of the
one class then the distribution
takes on either f1 or x3 right so those
are the two distributions that we don't
know
but we have clusters of points that are
labeled that come
from these two that may be separated or
maybe
interviewed
so the optimal classifier if you do did
know
these distributions could be constructed
um and it minimizes the
so-called zero one loss right the
probability of
error misclassification error
and it simply is a very intuitive
classifier it says
form from the prior distribution on the
class
and the distribution of the features
given the class
the posterior distribution of that class
given the data
given the features right and then the
church
so for each feature now you can classify
that feature according to whether it's
in class 0 or class 1
by looking at the most probable
label given
the feature and now you know the the
task
of estimating or constructing a
classifier
is to get a hold of a training set
and use the training set to somehow
learn this probability posterior
probability
whether that be through some particular
specific model
logistic regression model which
would be one solution or through
something that's uh
like like a deep mint okay we're
interested in the base clear
classification error rate not in the
classifier itself
so that uh it reduces to trying to
characterize what the average
misclassification error probability of
this
best minimum probability of error
attaining
classifier is and that's the probability
that the classifier
label assignment c of x is not the truth
not equal to y
well there's an integral representation
of this which is a
simple representation for this
epsilon function depends on the prior of
the
one class and of course on the
distributions that are unknown
f0 and f1 but as you can see it depends
upon this integral
of a difference it's a weighted
difference between the two distributions
so it's like
like a l1
difference or what's sometimes called
the total variation
difference between the two classes
weighted by
the probabilities of the associated
labels
so an alternative representation is of
this is
an f divergence where an f divergence
it depends on a function g
and an integral with respect to one of
the distributions of one of the classes
where this key is simply equal to this
quantity here
so it's a function of the
u which is a likelihood ratio between
two class densities and points of zero
and we now have an f divergence
interpretation of the
minimum probability of classification
that's where diversive estimation and
benchmarking
come together so the death divergence
is a quantity that goes back to
the early 60s in rich hazard
or holly sylvie
defined as general or general culture g
which has these properties which i'm not
going to talk about
but you can get callback weibuller
divergence which is
one divergence definition randy alpha
divergence
lots of other types of divergences by
simply changing the definition of g
that satisfies these properties
and uh instances total variation
distance
which looks almost like the uh the exact
phase error probability benchmark we're
looking for except for the fact it's not
weighted by the prior so this would be
equal to the base
probability of error if the product
probability of each label was
equal right we have equal probability of
a green or a blue point
being labeled right so uh alpha
divergence is another public library
just said
another one is helens without career
divergence there are lots of different
divergences
and henry penrose
kenzie penrose divergence is
the limit of
that count of dichotomous edges
so this is the particular divergence
that characterizes
that estimator or
characterization of uh
discriminability between the two classes
of the example where we're constructing
a spanning tree and throwing away all
the edges that connect the
points from the same class
all right the exact f divergence is is
uh given by
a generalized total variation
a divergence uh which includes those
priors that's
abound there's a lot of different
methods
to construct divergences the point
though is that learning to benchmark can
be reduced f divergences
and these bounds are important because
if you can get
a a
estimate of the vatican divergence of
the hypothetical coefficient
then you can sandwich the achievable
accuracy
right between two functions of that uh
coefficient
so estimating about a cardio coefficient
gives you a back
what i'm going to talk about in this
talk i've done a lot of work
on um over the past 10 years on
estimating various bounds what i'm going
to talk about here is
an estimate that uses
not a bound but the exact divergence
characterization
okay so this just shows the exact
divergence characterization
in blue the vatican
bound in green and this uh this
minimal spanning tree construction bound
the hinzi penrose bounded red just shows
that the hensey penrose mountains of the
middle spanning tree
is a tighter bound on the true accuracy
on the benchmark
and that is the uh product
which is a very commonly used bound in
communications
processing developed in the 60s
actually before that by 40s
leading to our
thesis that we should be using bonds
that are tight
as tight as possible for benchmarks and
not just arbitrary
bounds like uh
all right so that takes us then to the
defendable question how do you estimate
these f divergences right
so and that's what i mean by an estimate
we have data
we don't have the densities we don't
know what the densities of
the underlying classes are right we have
data
can we use that data then to find a good
estimate
of the divergence whatever divergence
we've chosen whether it be the exact
bazer or the nz penrose or what happened
so one approach is to estimate depth
at f0 and f1 plug it in
to the equation for the divergence
where you estimate them based on the
corresponding
outcomes the red the green and blue
label points right you estimate
uh f zero and one from those uh
with kernel uh histogram or other types
of
methods
if you do this we have we
suffer this curse of dimensionality
and what the way it's this curse is is
not a computational curse
it's like velvet's cursive
dimensionality this is a curse
in terms of the
accuracy in terms of mean squared error
accuracy in estimating the benchmark
estimating this divergence
and that accuracy is a function of the
number of
samples coming from any class
is this accuracy or this error
mean squared error pi squared plus
variance
square root decays gets better
right as you get more odd samples
but it gets better really really slow
for larger questions right d can be
for a typical application
in image classification like for the m
this digits
d can be on the order of a thousand so
this is like
basically one over log n for
larger
optimal parametric rates on the other
hand
have a much more
attractive decay
they decay as one over the root n
regardless
of the dimension of the features right
so here d is the dimension
of the feature space that
is limiting our
estimates of this benchmark
so what we'd like to do is overcome this
cursive dimensionality right and
when i say overcome what i mean is
find some reasonable assumptions
on the distributions
smoothness assumptions
that can take
this punitively slow
rate of gain inaccuracy as a number of
samples increases to this parametric
grid
so that's the picture
that um shows that with ensemble
estimation you can do this
if you make make suitable assumptions on
smoothness
of the densities of each class
this is parametric rate this is the rate
that you get with any
what we call base estimator which would
be a kernel
of that benchmark
so there's a there's a bunch of work
that i'm not going to go very quickly
through
that gives uh
these these achievable limits on
being able to estimate divergences
and so this is a smoothness parameter
over a holder class of smoothness of
densities i don't have time to talk
about that but
as this smoothness parameter gets large
the density gets smooth and smooth
down here your ellipse is smooth
for s equals one um which means you can
have
you can have step discontinuities
oh here you're smooth in terms of having
lots of of derivatives
e over two and d are the points at which
uh
d here is uh equal to eight just to show
this
shows that particular case and so this
is
the punitively slow rate uh that we get
um or d equals eight
it's on the order of m to the power of
one over sixteen
of negative one over sixteen and so
that's very slow raised
until the smoothness gets large enough
so that you can achieve
parametric rates by suitable
methods like convex risk minimization
which is
a algebraic technique
using dual
optimization plug-in estimators are down
here
you can do linear corrections ensemble
estimators which i'm going to talk about
later
if uh you restrict uh the
smoothness so if it's so it's an order
uh
d over two you could or or greater
you can get uh parametric breaks back
um and we're going to show how you do
that for a simple example
in in this context so boosting is
is the method that one one uses in order
to improve performance
when one has lots of different
estimators the idea
the fundamental idea behind this
this ensemble estimator is going to be
to use a bunch of base estimators bunch
of kernel estimators the distribution
all with different kernel widths
so different resolutions on
the estimates the bumpiness
of your distribution you use them all
and then this then gives you an estimate
of the divergence for
each one of the widths of the
kernel from l1 to
l to
l1 to l and then
you combine them with some linear
combination of coefficients
w0 l
to get a ensemble average estimator
and uh this is the the boosting concept
where the weights are computed according
to some
algorithm that typically depends upon
the data like with data boost
those are familiar with boosting
classifiers uh foreign shapiro
beta boost methods and
what the uh alternative that we that we
are achieving in this
talk and with the papers that uh
the paper three in particular it was
published uh just a year and a half ago
is that these weights
are date that we that we we can
plug in to get parametric rates
that is to get acceleration from
punitively poor almost logarithmically
slow convergence back to the best
parametric computers that is one over
root n
the weights required to achieve that
with kernel density estimates are
cleaner stabler or
locality sensitive hashing estimates
which i'm going to talk about in a
minute
can be computed those weights are
offline
they don't depend on the data you can
still achieve
these rates that's important because it
means that we
don't have to have the computational
burden of updating the weights as a
function of the data as it comes in
because you can online and you're
guaranteed to get uh
so sensitive hashing
is the simplest arguably
it is oversimplification
right of an estimate of a of a
divergence
because it boils down to
using a bin
histogram estimate of
f 0 and f1
where f0 is uh is in
blue and f1 is in red
generating these points and then
the divergence is basically a plug-in
with
n i over n that's the cell
count the average number of red or blue
points that lies in a particular
bin cell of the histogram and you plug
that in as an estimate of the
histogram which will be called ni
the red point account and mi the blue
point count
in any in the i cell and then
i only have i don't have all of the
tiling
of the two-dimensional d equals two
feature space here
because for these other tiles these
other bins we only have red points no
blue points which would cause this to
explode
so we just throw those away uh
that's why i say overly simplistic
heuristic
ad hoc whatever you want poor estimate
weak estimate those who are familiar
with shapiro's work
they also refer to boosting as boosting
weak estimators so what we're going to
do is we're going to take
various histogram widths on the cells
that are going to
be scaled appropriately
so that we get the best possible
estimator variance and each one of those
widths is going to give us a different
kernel estimate and we'll combine them
all together
and be able to achieve far
better accuracy on the benchmark
than if we used any one of these widths
alone
is this punitive almost logarithmic
bad accuracy
so i'm not going to go into the details
here but
the locally sensitive hashing plug-in
estimator can
you can derive bias and variance close
form
uh up to an order of uh
one over n times the resolution the
side length of the side of one of those
bins raised
power uh d equal to the feature
dimension
um and so the cis are what help capture
the distribution
uh capture the g the function that we're
using
in the divergence that's the bias
and that's the variance which is decays
as
one over n
and we can choose now epsilon
uh the width of the histogram
has a particular function to make this
remainder term
equal to one over root n which is the
rate we want
right that's what we want in terms of
our bias because the bias squared plus
the variance
square root of that sum is
equal to that root mean squared error
that we want to be equal to 1 over root
n
so if we choose we want the remainder
term
on this to be 1 over root n in order to
be able to get that so if you choose
epsilon equal to this quantity so it's
going to decrease
very slowly as a function of the
dimension
d and then as a function of m
so slowest term of the bias decay rate
if you do this
make this choice uh is
the leading term here i equals 1 to d of
epsilon to the i
epsilon equal to this order uh you get
back to this punitively bad rate right
by using ensembles we're going to find a
set of weights
so that with different uh scale factors
in front of this epsilon
we can sum over the bias by summing the
ensemble the ensemble summing the
estimators of
divergence and knock out successively
each one of these high rate terms
without knowing what ci
is that's the key
right so that's the ensemble method uh
you apply the weights to the ensemble
weights to each one of these divergence
estimates with a particular
epsilon kernel width
and uh that estimator is simply a linear
combination
the ensemble estimate is a linear
combination of each one of those
width specific estimates weighted by wj
the bias is equal to this
a term where the sum over these l
different widths on this ensemble
estimator
commutes with the sum over the dimension
the i
uh for the first order of the to death
order
uh series expansion that was used to get
that biased
expression and only affects
or t j i is the
parameter that characterizes epsilon
epsilon is equal to t n to the minus 1
over 2 d
so if we can divide if we can define the
wjs
so that this this quantity is equal to
zero
t is just some numbers right some
numbers that we select
on the scale the width uh then we can
knock out this entire leading term and
we end up with the order
one over root n uh so
this uh is a set of
l equations and now unknowns the weights
and that leads to a system that
is a linear system of equations
so we could find uh the solution w1 to
wl
in the null space of
this matrix in order to zero out that
that higher order slow converging
terms in the bias so chevy chef
methods can be used to design
these scale factors
uh kernel estimates of
width parameter uh which i don't have
time to talk about
but there's a chebyshev function
approximation can be used to
basically form the design for this
matrix
so that we have the most uh stable
solution as a function of
variations in uh
the number of ensembles
l right uh estimates
so variance uh is on the order of
a quadratic in the weights
it simply follows from the form of the
weighted ensemble estimator
w transpose times the estimate the
vector of
these multi-scale
kernel estimates of the divergence so
that's w
transpose the covariance times w
and if that coherence has an eigenvalue
which is
equal lambda max then this can always
bounded up above
by the norm l2 norm of w squared
and now we have a optimization problem
that can be done completely offline
to solve for the weights that
simultaneously
achieve low variance
uh zero out bias terms
so l has to be greater than d in order
for this solution to exist
right and this ensures the mean squared
error is now
of one over n that is the root mean
squad error is the parametric rate
so i just show this multi-class
classification
application where we have our benchmark
learner
is here in blue that we don't know the
model the benchmark doesn't know the
model we've constructed this
uh as a four-class model
concentric sphere regions uh
characterize the classes so it's not
linearly separable
for each of the four classes a nested
sphere
right difference spherical difference
from
the center sphere to the
set difference of spheres that contain
the center sphere
and then successively contain that
that sphere and so forth and so it's
over 20 dimensions
and so we implemented here
uh a neural network
with four layers uh
relu and atom and so forth as methods
as activation functions uh and
the method for estimating the gradient
from back propagation
and so you have then uh all of these
accuracies by implementing boosting
uh methods xp boost is a method for
classification
we're not estimating the accuracy here
in the in the green
curve we're using a boosting classifier
deep neural network
random force and so forth and the
benchmark says the phone
it says that um we have a substantial
margin of improvement
if we could um improve
for if we had less than say 3 000
samples
then we could improve the accuracy by
a substantial market relative to the
curve with with the data
that is available from the realization
the support
another application is to figure out a
stopping rule how many how many data
points is sufficient
if you're using a stochastic gradient
descent for example
which is successively choosing samples
in your data set
how many are sufficient to uh to know
that you're close enough to benchmark
that you can stop
your iterations so the benchmark learner
in blue
is would tell you you can stop at about
850
iterations of the stochastic gradient
descent
um mnist this shows again
for the endless data set we get a 14
error rate from the benchmark learner
and so that means there's still about
six percent
six or seven percent uh left
relative to the best available uh
as of 2013 anyway
learners that the machine learning
algorithms classify
digits in the mns data center okay
so i want to terminate this how much
time do i have
four minutes okay so uh just on a quick
um example
of applying this benchmark learning to
deep neural nets
to characterize the information
bottleneck i'm not going to cover this
in any detail
uh but the slides i think i can give
them to you and they can be posted
but the fundamental questions about
these people are let's
is that we have a remarkable empirical
performance
evidence but we don't really understand
to a
sufficient degree why they work
and there's hypotheses one of the
hypothesis is called the compositional
learning hypothesis
which says that the dnns learn in two
phases
they first memorize on all the easy
cases
in the two classes they memorize
patterns the easy classification
easy to classify patterns in the two
classes
and then they generalize to the hard
cases that is they go through a
compression phase
in phase two and so
this phase one phase two uh over a
deep learning structure such as this one
here i'm showing
with an input layer and four
intermediate layers and an output layer
logistic
layer multi-layer perceptron this is
all layers are connected all layers are
completely connected to the previous
ones
these this characterizes layer by layer
as a function of time time starts here
in terms of training
the compression versus
memorization point and
i'm not going to explain any more than
just to say that
this is a very
common method today of characterizing
these these memorization versus
compression characteristics of deep
neural nets
and that this characteristic convex
structure between the compression plane
and the discrimination the compression
axis and the discrimination axis
was held by
natalie tishby as confirming this
compositional
hypothesis that you start out by
learning the easy cases and then
compressing
the pressure goes back this direction
so um what we showed is that if you use
our method for estimating
divergence mutual information being used
here as the
pressure plane quantity that you can do
much better
in terms of estimation performance
accuracy in red and orange
is our ensemble estimator of mutual
information
for each of these layers
and then what's in a green and blue
are the much noisier are the estimates
that are used in
tishbe shorts uh
sports paper and basically confirming
this compositional hypothesis
uh with a much better
our two really we use two metal learning
concepts the first is
we are learning ensembles so you're
learning uh
ensembles of learners and the second one
is that we're learning how difficult
the learning problem is right the second
metal learning problem is learning the
bayes error rate
you can use this ensemble method that we
introduced to achieve great optimal
performance
and we can apply this to
many different problems in learning to
benchmark
so that's all i really have to
present today thank you for listening
just like when we learned the examples
that you learn both weak learners and
examples away from data how you would
avoid the overfitting issue
yeah so here what we have to
keep in mind is that
we're not learning over a feature space
uh a set of
coefficients of a deep learning
network or a svm
right so we're not learning a large
dimensional quantity
what we're learning is a one-dimensional
one
right so it's the benchmark
and the point is that the weights that
are used in the
ensemble learning are computed offline
so there's no generalization error there
you're not overfitting because those are
those are computed based on the
chebychev polynomial
solution stabilizing solution of
the linear set of equations that i
showed
so the problem of overfitting is really
really bad
when you're trying to estimate high
dimensional quantities right
like coefficients of some classifier but
now if you turn the problem on its head
such as we're doing here let's say we're
not interested in the classifier
this high dimensional problem i'm just
interested in estimating the performance
for just one one scalar quantity
this uh overfitting problem is
is is much less
of a problem yeah i go towards me and
another question if i made so i'm
i'm trying to think what's the practical
guidance value for
like practitioners in like the deep
learning world so
doesn't mean that uh like a four image
i'm not i'm
i'm i'm not quite sure how this is
generalizable to a very large and
potentially noisy dataset and
potentially not even id that is a
like imagine that so for example if i do
a real
image classification rather than a list
using this to predict
error rate and if it's lower than
the ever achieved error rate by all
passive models
doesn't mean that i could continue
working on this data set
without uh being worried about
overpaying you stated it better than i
could have yes
that's exactly what that this tells you
in the case uh these these methods
allow you to leapfrog over
this rat race that i returned and
referred to in the beginning
of being on the treadmill to develop
more and more
sophisticated learning methods and
worrying about overfitting and so forth
right so by by aiming at estimating
performance
directly you you can
at the very least define a horizon
that you cannot go beyond right up to
some
confidence interval depend on the size
of the dates
miracle thank you for explaining people
so so how
questions here are you have the method
to
estimate this error performance but
does that immediately imply what kind of
method you can divide or
does that really imply some method you
can
develop at the uh performance that's the
holy grail
yeah and that that would be that would
be fantastic if we
you could extract information from these
bounds and tell you something about how
to achieve them but in
as in many um cases
of engineering
systems
going from bounds to practical
applications directly
with practical algorithms is very
difficult
claude shannon defined is
bound on the maximal rate that you could
push through
a channel limited bandwidth channel
capacity
back in 1948
it took 50 years
for turbo codes to come out
to show that there's an algorithm that
can actually
achieve this rate so
those turbo codes were not guided at all
by shannon's construction of his
random random construction using typical
sequences
that didn't yield any insight into
how to achieve that there are two
separate problems it's for that reason
that in response to this gentleman's
question
that you can sidestep this
overfitting problem if you focus on
accuracy but if you try to turn that
around and and specify an algorithm
that'll change that accuracy i think
there's probably some fundamental
like heisenberg uncertainty principle or
something lurking there that
says you can't do it right but i don't
want to be so negative
right i thought about question uh so for
the bond uh
you provide the inside because of the
way you can calculate the bond too close
to the two
but true error rate is because of you
eliminate
that slow conversion yes the bias
reduction part
now by looking at the practical uh like
the cnn method dn vector
and perform better than the other matter
does that imply
the dn actually does that error
reduction more efficiently
than the other method possibly although
we don't know about it we don't know for
sure we do know that the bias is smaller
because the variance bars along the dnn
and svm and so forth which i didn't show
in that
plot are pretty much of the same word so
the gain is virtually uh 100
due to a decrease in bias so somehow
dnns are moving towards
a reduction in bias and eventually
perhaps
matching these fundamental benchmarks
that can be determined from data
any other questions
well if there's no other questions i
would like to present the
smart token appreciation throughout here
for coming thank you so much
thank you for coming
