Hello everyone, my name is Arham. 
In this video, we will look at the fundamental concept
of clustering and types of clustering methods.
Clustering is grouping data points 
in creating partitions based on similarity.
If two things are similar in some ways,
They often share other characteristics.
Almost everything we perceive is in the 
form of clusters when we look up at the night sky
we see clusters of stars and we 
name them after shapes they resemble.
Similarly, a cluster is a set of 
similar data points or a set of points
that are more similar to each other
than two points in other clusters.
It is classified as an unsupervised 
learning technique. And the key difference
from other machine learning techniques
is that clustering does not have a response class.
After grouping observations, a 
human needs to visually look at the clusters
and optionally associate meaning to each cluster.
The ultimate prediction is the set 
of clusters themselves, and this technique
works only with data that is in numeric form.
This means that any categorical variable needs 
to be converted to a numeric variable by binarization
This is popularly known as one hot encoding.
There are many methods to 
predict clusters by calculating similarity.
And I will now introduce you to 
four different types of clustering methods.
The first one is, centroid based clustering.
Each cluster is represented by a centroid which derives
clusters based on the distance 
of the data point to the centroid of the clusters
One of the most widely used 
centroid based algorithms is K-Means.
K here stands for number of 
clusters and K needs to be defined by the user
This method starts by 
randomly placing centroids and iterates
Until the centroids find the 
shortest sum of distance between point to center.
It minimizes the aggregate intra cluster 
distances and every cycle results in different clusters
The second one is connectivity based clustering
The clusters are defined by grouping the nearest neighbor, based on distance between the data points
The idea is that nearby data points are 
more related than other points farther away
The key aspect is that one 
cluster contains other clusters
Because of this structure,
the clusters represent a hierarchy.
This method works in two ways. It 
either starts from the smallest cluster and each step
two clusters that are similar are combined 
into a bigger cluster in a bottom-up manner,
Or starts from the biggest cluster 
and each steps divides into two in a top-down manner.
Clusters are represented by a dendogram here, 
which explicitly shows the hierarchy of clusters
The third one is distribution based clustering.
This method each cluster 
belongs to a normal distribution
The idea is that data points are divided based on probability of belonging to the same normal distribution
It is similar to centroid based clustering, 
except that distribution based clustering uses
Probability to compute the 
clusters rather than using the mean
The user needs to define the number of clusters
This method goes through an iterative process 
of optimizing the clusters and a popular example is
expectation maximization algorithm which 
uses a normal distribution for clustering the data points
The fourth one is density based clustering.
Clusters here are defined by 
areas of concentrated density.
This method begins by 
searching for areas of dense data points
and assigns those areas to the same clusters.
It's based on connecting 
points with cell certain distance.
A cluster contains all linked data 
points within a distance threshold.
And considering the sparse 
areas as noise or borders between clusters.
I will now go through some clustering weaknesses.
In most clustering methods
we need to supply the number of clusters. 
We can use an approximation method to estimate
the number of clusters called as elbow method
Lastly, remember that clustering 
algorithms are always sensitive to outliers.
When you search for something on Google
or go on to Amazon to buy something, you 
are presented with links or products that are relevant
to your search by means of clustering.
All of the methods we looked at today boil 
down to the basic idea that we want to find groups of
similar objects. If you have any other topics 
you'd like us to cover leave a comment down below.
Give us a like if you found this 
useful, and if you want to see more
Check out other videos at tutorials.datasciencedojo.com. Thanks for watching!
