Good morning, today we will have the second
part of the lecture on Ensemble Learning.
In the last class, we gave a big introduction
to the different ensemble learning methods.
Today, I will talk about two specific methods,
bagging and boosting which are methods for
ensemble learning.
Bagging is a method for ensemble learning
bagging stands for bootstrap aggregation.
So, we saw that in order to have in order
to use an ensemble of learners and be able
to combine the learners and get better accuracy,
what is needed is a set of learners who which
make independent errors you want to set of
learners which make independent errors. And
we discussed, in the last class, in order
to force the learners to make independent
errors, you can make them use different samples
- data samples, you can make them use different
parameters of the algorithm, or you can make
them use different algorithms and the output
can be combined by some form of voting.
Now, in bootstrap aggregation, what is done
is that; first of all the samples are generated
such that the samples are different from each
other. However, suppose, you have a training
set S or let us call it D, and if you are
going to have n learners or let us say k learners,
you could split the training set into k partitions
and use one for each learner.
However, often it is the case that the size
of the training set is not large, so if you
make it smaller by splitting it into k different
sets, the individual training sets will be
small. And as we have observed while talking
about the foundations of machine learning
is that if you use a learner on a small training
example, the learner can over fit, it can
have high variance and it will not generalize
well. So, we want to be able to use you know,
we do not we will not be able to afford to
have fully independent data, but we can sample
randomly from the data, so that you know there
may be some overlaps. And we do that by a
method of sampling instead of dividing it
into k disjoint groups.
We treat this as the pool; and from this pool,
we sample D 1, D 2, D 3, D k, so these are
the different data subset which is sampled
from the original data. They may have some
in common, but they also will have different
values. Now, as we have discussed that what
we do is that to get D i draw random samples
with replacement. For example, if we desire
that each of these D i’s, we will have m
examples we randomly draw m samples from D
with replacement that is we can have the same
instance repeated several times and some instances
may not appear in a particular D i, and so
these D i's can be different.
We have discussed in the last class that in
order that ensembles will work we desire that
these learners may have high variance. And
by combining these k learners, we can reduce
the variance you know in certain cases by
1 by k. So, we can start with learners with
high variance. And we also discussed that
in order that the different learners produce
different hypothesis, we want a learners to
be unstable, stable learners you know the
respective of what data there you get, the
output will be similar that is they have low
variance. Learners with high variance are
unstable.
Among the algorithms that we have studied
decision tree, then neural network they are
unstable algorithms; decision tree neural
network are unstable algorithms given different
data, they may give different models. Whereas
algorithms like k nearest neighbor is more
stable. So, when we want to use ensembles
or specifically bagging, we wish to use unstable
learner. And we use bootstrapping to generate
the training sets; we generate these k training
sets. And we train.
So, once we get these training sets so suppose
in this D, we have the instances 1, 2, 3,
4, 5, 6, 7, 8. And we sample and we get D
1 and D 1 may have 1, 5, 1, 2, 3 and D 2 may
have 6, 1, 6, 5 4; D 3 may have 4, 1, 5, 1,
1. So, we get these different learners different
data sets. And we train a learner L 1, L 2,
l K or these different data sets and we get
a model. And these models have to be combined
by voting. So, we use bootstrapping to generate
these k learners, and train a base learner
with each and combine the output by voting.
So, in bagging, what we do is that we do sampling
in order to get these individual data sets,
we do sampling with replacement. And from
each sample, we build a bootstrap sample.
And if you look at each instance in the original
data set, what is the probability that it
is included in the sample.
Suppose, there is a sample of size m sample
of size n right and suppose this data set
is also of size n. So, if you take any instance
what is the probability that when you draw
randomly from this data set from the sample,
what is the probability that a particular
instance will get included in D i? It is 1
minus 1 by n. And if you do n such draws,
so if this is n size n and each of the subs
samples are also of size n, then this is the
probability that a particular instance will
get included in the sample.
And so this particular sample will get selected,
and we can see that you know this can be approximated
by a e inverse and we can see that if we use
n random samples about 63 percent of the instances
about 63 percent of the instances will occur
in each of these samples; 63 percent of the
unique instances. And other instances will
be repeated this you can find out. Now, So,
this is about bagging.
Next we come to a second ensemble algorithm
which we call boosting. Now, in bagging we
are generating the data samples D 1, D 2 from
the original data set.
And they are all getting the same probability
of getting included in any of these subsets.
Now in boosting, what we do is that again
we have a sample. Now when we try to get D
i, we have to sample from D. But now what
we do is that we assign a probability with
each instance, so every instance is assigned
a probability. Suppose, to start with there
are n instances and initially we may have
we may give equal probability to all the instances.
So, each of none of them has a probability
of 1 by n.
In the first iteration, you choose D 1 based
on this uniform probability distribution,
and you get D 1 by sampling. Now on D 1 you
train a learner L 1; and this learner, you
apply to all these instances. Now this learner
may have 100 percent accuracy on these instances
you know if that is the case then it is a
it is a very good learner, you no need to
really proceed, so, it will have you know
all of them it will do it correctly. But in
most of what will happen is that this learner
will label sum of these instances in D correctly
and some wrongly.
Now, suppose, D 1 labels this correctly, labels
this wrongly, labels this correctly, labels
this correctly, labels this wrongly, labels
this wrongly, labels this correctly, and labels
this correctly. Now what we want is we want
the next learner to have independent errors,
and we want the next learner to do well on
those instances where the current learner
has not done well. So, now what we will do
is we will change the probability distribution.
And more specifically, we identify those instances
which were wrongly labeled by learner one,
and we will increase the probability of these
instances. And those which were correctly
labeled we will proportionately reduce the
probability.
So, what we are going to do is that we want
to hide the probability of those instances
which we were not correctly classified by
learner L 1. And So, we will get we will have
the same dataset, but now the probability
distribution will change. Now we will now
sample D 2 according to this probability distribution
and we will get a sub sample and this sub
sample is more likely to contain 2 5 6 then
1 3 4 7 8, so we get D 2. And on D 2 we apply
learner we apply the learning algorithm and
get the learner L 2.
Now, again we apply L 2 on these training
instances, and find out which ones D 2 does
error on. Suppose, D 2 wrongly classifies
2 7 and three, and the rest it does correctly.
So, what we are going to do now is to increase
the probability of 2 3 and 7 beyond what was
the current probability, and reduce the probability
of the rest to keep the sum of the probabilities
equal to 1.
So, based on that we are going to find D 3,
on which we are going to train learner L 3
like this we will go on, we will get the different
learners a different data and the different
learners. And then we will combine these different
outputs of the different learners by voting,
and while voting we will assign a weight to
each learner and the weight will be somewhat
related to how good the learner was. So, this
is the basic boosting algorithm.
So, in boosting is an iterative procedure,
we start with a uniform probability distribution
on the given training instances. And we adaptively
change the distribution of the training data.
So, initially all the training instances have
equal weight after each round of boosting
the weights get changed. So, we learn a hypothesis
or a learner and we assign a strength to each
learner we assign a strength to each learner.
And this strength is used to decide the weight
of the voting, and the final classifier is
a linear combination of this different learning
hypothesis you know weighted by these weights
right. So, what we require is that the individual
what we have said earlier also the individual
learners may be weak learners they must be
unstable they must be weak learners. In fact,
it is better if they are weak learners, but
they must have more than 50 percent accuracy
for a 2-class problem.
Now, there are several boosting algorithms.
One most common algorithm for boosting is
Adaboost which I will talk about in today's
class. So, we have seen that boosting can
turn a weak algorithm into a strong learner.
So, Adaboost takes a sample as input, and
let us say we call this sample S. And it contains
x 1, y 1, x 2, y 2 x m y n. This is the training
example x is a vector of the input attributes
and y is the output attribute. And we have
a weight for each training example initially
D 0 i is the weight of ith training example;
initially D 0 i equal to 1 by m for all i
between 1 to m. Initially at time 0, all the
learners have equal weight which is equal
to 1 by m and we use a algorithm.
Now, we do capital T number of iterations.
In the first iteration we construct D 1 on
x 1 x 2 x m, we construct D 1; and by sampling
from this S, using the distribution D 0. So,
let us write it as script D, so this is D
0. So, we get D 1 or let us me call this S
1. Then what we do is that we update D 1.
So, for those i, for which from S 1 we get
a particular learner L 1. So, we find out
which training examples L 1 classifies correctly,
for those training examples that L 1 classifies
wrongly, we increase D 1 i, D 1 i is increased
for those examples on which L 1 wrongly labels.
So, we update D 1 i; and from D 1 i we sample
S 2. And from S 2 by running our algorithm
we produce the learner L 2. And let us say
when we apply the learner L 1 on S the error
is epsilon 1. When I learn L 2 on initial
S error is epsilon 2; similarly S 3 produces
l 3 with error epsilon 3. So, epsilon t is
the error over the current sample; this is
the error not over the original data, but
over the current sample epsilon 3 is the error
over S 3.
Now, with this, let us just look at please
look at the slide to the boosting algorithm.
So, we are given S equal to x 1, y 1, x 2,
y 2, x m y n, where x is the set of input
attributes, y is the output attribute, and
initialize the distribution to 1 by m. We
initialize the distribution to 1 by m.
And then we run this iteration for t equal
to one dot dot T. So, we trained the weak
learner using D t, and get a weak classifier,
let us call it h t. Then we choose alpha t
as the weight associated with h t, we will
see later how alpha t is chosen. And we update
the distribution, so we get D t plus 1 i as
the updated distribution which is obtained
from D t i into exponential of minus alpha
t y i h t x i. So, y i is the target output
of the particular training example i, and
h t x i is what is output by the learner h
t. So, if they agree this is 1; if they do
not agree, it is minus 1
So, divided by a normalization factor this
normalization factor is chosen such that the
sum of the probabilities sum of this overall
i is equal to 1. So, Z t is a normalization
factor. Now, we do this capital t number of
times and final classifier that we output
is given by h x equal to the sign of sigma
alpha t h t x. So, h 1, h 2 h t are the different
classifiers that are output and h t are the
different classifiers they are weighted by
alpha t. And the total is computed if the
total is positive then the class is positive
if the total is negative the class is negative.
So, this is the basic Adaboost algorithm.
You can see the slide here, where the algorithm
is given. So, this is our sample; y i is either
minus 1 or plus 1, this is the initialization
of the probability distribution. And we run
this algorithm for capital t number of iterations.
In each iteration, we use the current probability
distribution to generate the data; and on
that data we train a classifier h t, we find
alpha t i will show you how alpha t is computed.
And based on that the new probability distribution
is computed which is D t plus 1 i. And z t
is the normalization factor so to make sure
that the sum of this is equal to 1, and the
final classifier is a taken as a vote of the
different h t’s weighted by alpha t.
Now, let us see how alpha t is computed. First,
we need to compute epsilon t which is the
error of the hypothesis h t with the current
sample. So, epsilon t is summation over the
training examples D t i is the probability
of the ith training example in the sample.
And delta h t i not equal to y i is you know
if i is misclassified this is 1, otherwise
this is 0. So, what we are doing is that we
are taking the number of misclassifications
each you know with a taking the weighted sum
of the number of misclassifications so weighted
by the probability so that is epsilon t.
And alpha t is computed as half log of 1 minus
epsilon by epsilon t, so this epsilon is a
measure of the error and alpha is a measure
of the accuracy of the particular hypothesis
h t. And in the ensemble when we are combining
the votes of the different hypothesis we are
combining based on alpha t. So, this is the
mechanism which is followed by the Adaboost
algorithm which is a very popular ensemble
learning algorithm.
Now we will talk about a little bit more about
why this ensembling works. So, if you are
given a number of weak classifiers, they can
be combined to get a strong classifier. These
weak classifiers must have error better than
random that is epsilon t must be less than
0.5 for a 2-class problem it can be shown
that this algorithm adaboost will achieve
zero training error exponentially fast.
Under the assumption that the errors are independent
errors of the different hypothesis are independent.
It is shown that 1 by m, this is the average
error delta h x i not equal to y i i equal
to 1 to m this is the average error, it can
be shown the average error of this ensemble
produce by a Adaboost is less than equal to
the product of over all the members of the
ensemble. This is the product of all the learners
that we are using in the ensemble times Z
t, which is less than equal to exponential
of minus 2 small t equal to 1 to capital T,
this is the total number of iterations in
the Adaboost half minus epsilon t whole square.
So, epsilon t, as we said has to be less than
0.5, but even if it is a small number you
know it may not need not be very much smaller,
but whatever it is so this is the half minus
epsilon t is a is a fraction less than 1.
And as you take this product, this you know
this will become as t increases this will
drop to zero exponentially fast, and therefore,
Adaboost can this algorithm if the learners
are independent it can become very good as
we increase the number of iterations. So,
we are not going into the theory, but this
is the result which has been proved.
Now, we will show please look at this slide
we will illustrate AdaBoost, how it works
So, suppose this is the initial data. We have
10 data points 5 of them are positive which
is given by green and given by blue are positive.
And the 5 green ones are negative, and all
of them each one of them has probability 0.1.
So, sum of the probabilities is 1. Now when
for boosting you select the first set of training
example and the ones within squares get selected.
So, these are the data points that you use
for training. And based on this you get a
classifier and this classifier, let us say
predicts these 3 as positive and these 7 as
negative.
So, we see that these three are correctly
labeled as positive, but here these two are
wrongly labeled as negative. So, now this
probability distribution is updated. So, these
two where AdaBoost makes an error their probability
is boosted using the formula of boosting and
they get boosted to 0.4623 and to balance
the probability of these 8 examples are reduced,
and they become 0.0094. Based on these values,
the value of epsilon and alpha can be calculated
and alpha happens to be 1.9459.
Now, what happens is the next round takes
place. In the next round, we have this new
distribution from which the data points are
sampled. And let us say now these 6 points
which are in square a sampled, and suppose
the learner that to come up with this B 2,
now B 2 again makes some error and based on
that error you know B 2 makes error on these
three examples.
And based on this the probabilities of these
examples are increased, and the others are
reduced, and alpha in this case is computed
to be 2.9323. Then the third round takes place
on this new distribution. And let us say third
round chooses some samples and B 3 is this
classifier. And there is an error on these
values this middle values and so the error
on this values based on that the probabilities
are recomputed, and alpha is now computed
to be 3.8744.
Finally, these 3 classifiers are combined
with alpha 1, alpha 2, alpha 3 and the overall
classifier now is able to label these as positive,
these as negative, these as positive. So,
this is in this particular case, boosting
is able to combine these three classifiers
into a classifier which is fully accurate
it may not always happen that you get it something
fully accurate.
But what is the interesting you see that these
classifiers were linear classifiers they just
made a division of this interval; whereas
the ensemble classifier does not belong to
the same type of classifier, it is able to
identify that there are positive examples
here and negative examples here. So, this
is an illustration of the AdaBoost algorithm.
With this, I stop today's lecture on boosting.
Thank you.
