Hello!
In this video, we'll be covering k-means clustering
plus advantages and disadvantages.
Clustering is a type of unsupervised learning
because it can group unknown data through
the use of algorithms.
Clustering is used to determine trends and
patterns in sets of unlabeled data and usually
groups sets of data together by using a centroid
and distances from the centroid to other points.
Clustering will be explained more thoroughly
when discussing unsupervised learning algorithms,
such as k-means and density-based clustering.
From learning these algorithms, you should
gain a better understanding of what clustering
is.
Clustering is a type of unsupervised learning
because it can classify unknown data through
the use of algorithms.
Clustering is used to determine trends and
patterns in sets of unlabeled data.
It usually groups sets of data together by
using a centroid, and examines distances from
the centroid to other points.
These centroids are initialized randomly.
The data is then grouped with the centroid
based on distance.
The centroids move to the average distance
of all points to the centroid.
The points are re-classified and the centroids
move again.
This continues until the centroids no longer
move.
Please note that whenever a centroid moves,
each point's distance to the centroid needs
to be measured again.
Now, let's define the algorithm more concretely.
A K-means algorithm works by randomly placing
k centroids, one for each cluster:
The farther apart the clusters are placed,
the better.
There is no specific number of centroids that
you need to have.
This is based off of how many clusters you
want to find, therefore, there will be a centroid
for each cluster.
The next step is using a K-means algorithm
is to assign each data point (object) its
closest centroid, creating a group.
Euclidean distance is used to measure the
distance from the object to the centroid.
Note, however, that you can also use different
kinds of distance measurements, not just Euclidean
distance.
Euclidean distance is used because it's the
most popular.
Next, Once each data point has been classified
to a group, recalculate the position of the
k centroids.
The new centroid position is determined by
the mean of all points in the group.
Finally, as noted on the previous slide, this
continues until the centroids no longer move.
For simplicity, this example will be looking
at K=2, in other words, the algorithm will
be creating 2 clusters.
One cluster will be blue, while the other
will be green.
So these are the unlabelled data points in
2D space.
We want to create two clusters, so we need
to start off with two centroids.
As you see, they've been randomly placed here.
A good way of determining which points belong
to which centroid is by first drawing a line
to each centroid.
Then, we need an orthogonal line that intersects
this line in the middle.
As you can see, everything to the left of
the red dotted line is considered blue.
And everything to the right of the line is
considered green.
Next, we need to recalculate the position
of the centroids, which results in the centroids
moving to these locations.
Once again, let's get the line that separates
the two centroids and see the perpendicular
line that separates them.
At this point, we only see two points that
need to change from blue to green, so let's
get rid of these lines to see where the centroids
need to move now.
Now they'll move.
And again, we need to see if anything will
change.
So the two lines move once more.
So now, we see that there are no points that
change.
Therefore, the centroids will not move anymore,
which means that the algorithm is finished.
There are two main advantages in using K-Means:
First, it's easy to understand, and
Second, it's very fast compared to many other
clustering algorithms
K-Means also has some disadvantages, as well:
First, It has no specified initialization
of cluster points and it has a High variation
of clustering models based on initialization
of cluster points
Second, getting accurate results depends on
distance measuring metrics.
And finally, there is the possibility of a
centroid not having data points in its group,
therefore not being able to be updated.
Thanks for watching!
