Hello! In this video, we'll be covering density-based
clustering.
Density-based clustering is also known as
DBSCAN or Density-Based Spatial Clustering
of Applications with Noise
It works by defining a cluster as the maximal
set of density-connected points.
There are two parameters that are taken into
account, epsilon (and minimum points
Epsilon is the maximum radius of the neighborhood
and minimum points is the minimum number of
points in the epsilon-neighborhood to define
a cluster.
There are three classifications of points
They are
Core, Border, and Outlier.
A core point has at least minimum points within
its epsilon-neighborhood (including itself).
These are points that are at the interior
of a cluster.
A border point has less than minimum points
within its epsilon-neighborhood, but can be
reached by the cluster; that is, it is in
the neighborhood of a core point.
And finally, an outlier or noise point is
a point that cannot be reached by a cluster.
A point y is said to be 'reachable' from x
if there's a path p1, ..., pn with p1 = x
and pn = y, where each pi+1 on the path must
be core points, with the possible exception
of pn.
An object y is directly density-reachable
from object x, if x is a core object and y
is in x's epsilon-neighborhood.
a is directly density-reachable from b
b is directly density-reachable from c
a is (indirectly) density reachable from d
d is not density-reachable from a, since a
is not a core point
Density-Based Clustering works by picking
a random point that does not belong to a cluster
or is an Outlier and determines if that point
is a Core Point.
If not, label it as an Outlier.
Now, we'll pick another point at random. And
determine if it is a Core Point.
If it is a Core Point, then all the directly
reachable nodes are assigned to the Cluster.
Then do neighbor jumps to all reachable points
and add them to the cluster.
Keep finding neighbors until you are unable
to.
This point would be labeled as a Border
Now keep repeating Steps 1 and 2 until all
points have been labeled or assigned to a
cluster.
In this case, the last two points are also
Outliers.
Now we have our cluster!
At this point, it's important to know the
difference between Density-Based Clustering
(or DBSCAN) and K-means DBSCAN is effective
at cutting the noise out of datasets by labeling
them as Outlier, whereas K-means uses the
entire dataset.
In this example, we have two sets of clusters
resembling 'half-moons' each belonging to
different cluster sets. Now if we add some
noise we can see if DBSCAN or K-means correctly
assigns the clusters.
Let's run DBSCAN. We'll see that it'll correctly
group the top half moon as a cluster. And
it'll do the same for the bottom half moon.
Then it will predict that the rest of the
points are Outlier or 'Noise', which is correct,
of course!
Now let's look at the same half-moon example
using K-means.
This will also include the noise data points.
Now two random centroids start off.
And the collective points are assigned.
Then, each centroid is moved accordingly to
their new locations.
And once again the collective points are assigned.
And this cycle repeats.
Until the centroids are at the point where
no new points are assigned to either side
from their movement.
Once K-means is complete, we can see that
it did not correctly predict the half-moon
clusters nor the noise.
K-means has a huge disadvantage in that it
considers the whole dataset and cannot distinguish
noise from cluster groupings. It might also
identify clusters incorrectly due to its reliance
on centroid placements.
Thanks for watching!
