Hello!
In this video, we'll be covering hierarchical
clustering plus advantages and disadvantages.
Next, let's look at hierarchical clusters.
There are two forms of hierarchical clustering:
Divisive and Agglomerative.
Divisive is top-down, so you start with all
observations in a large cluster and break
it down into smaller pieces.
Think about divisive as 'dividing' the cluster.
Agglomerative is the opposite of divisive,
so it is bottom-up, where each observation
starts in its own cluster and pairs of clusters
are merged together as they move up the hierarchy.
Agglomeration means to amass or collect things,
which is exactly what this does with the cluster.
A dendrogram shows all of the clusters that
have been grouped together.
Notice the differences in heights.
For example, if we are observing this dendrogram
with hierarchical agglomerative filtering
in mind, then we will be thinking of grouping
clusters rather than splitting.
So let's say two clusters are grouped here,
and another two clusters are grouped here.
If both these of these clusters are grouped
together once, how come one is higher than
the other?
If you look at the y-axis, you can see that
it's a measure of distance.
Therefore, the higher the point of connection
on the dendrogram, the further apart the clusters
are from each other.
Note that the same principle of distance measure
applies to hierarchical divisive clustering
as well.
Here's a quick example to reiterate the explanation
on the previous slide.
Let's assume we have these five points.
Each of these points are clusters of their
own, and have not been joined with another
cluster yet.
Since we're joining the points based on lowest
distance, think about the first two points
that would be joined.
Did you say the green and blue points?
That would be correct!
This join has a small distance, so the height
of the dendrogram should also be small.
OK, so what do you think the next join would
be?
Did you say the red and pink clusters?
This distance is larger than the first join
we did, so the height should be higher on
the dendrogram.
So maybe it'll look something like this.
Following this logic trail, the next join
would be on the two clusters we made.
The yellow cluster is kind of off on its own.
Let's say the distance between these two clusters
is larger than the distance between the red
and purple cluster (from the second clustering),
this will give us a height that is larger
than the second join.
Finally we have one clustering left to do!
And we need to join those clusters, then draw
the final join on the dendrogram!
Next, let's review a proximity matrix by using
an example.
Let's start with five different clusters,
based on different colors (red, green, blue,
purple, and yellow), with each colour being
assigned a letter (A through E).
To the right of these clusters, we have the
proximity table itself.
This table contains the distance measure of
all points from each other.
So what do you think is the first thing we
can label on this table?
All proximity tables are centred on a symmetrical
diagonal line.
And the fields that are being dissected are
assigned distances of zero, because the distance
from a point to itself is, after all, zero!
With that done, let's start filling in this
table!
(Note, the following measurements are not
accurate and are just for illustration purposes.)
Beginning with the red cluster (or A), let's
find the measurements of this cluster to each
cluster.
Let's put these values into the table.
Remember that the table is symmetric, so the
first column's values are equivalent to the
first row's values.
Let's move on from the red cluster since that's
completed.
Looking at the green cluster (or B), we have
the following measurements.
Note that we do not need to re-measure to
the red cluster because it's been measured
once already.
Once again, we'll fill in the table with the
appropriate values.
And then remove those distances.
By this point, you should have a good idea
of how to build this proximity table, so in
the interest of saving some time, let's simply
fill the rest of the table quickly.
There we go!!
That is our completed proximity table for
this example!
There are 3 main advantages to using hierarchical
clustering:
First, we do not need to specify the number
of clusters required for the algorithm,
Second, hierarchical clustering is easy to
implement, and third,
The dendrogram produced is very useful in
understanding the data.
There are some disadvantages as well:
First, the algorithm can never undo any previous
steps.
So for example, the algorithm clusters two
points, and later on we see that the connection
was not a good one, the program cannot undo
that step.
Second, the time complexity for the clustering
can result in very long computation times.
Finally, if we have a large dataset, it can
become difficult to determine the correct
number of clusters by the dendrogram.
Thanks for watching!
