Hello world. It's Siraj, and we're going to detect who the intruder is in our security system
So say we've got a website and we're monitoring all of this network data
And there's one guy one bad dude one bad hombre. Who's trying to break into our security system
We want to find who this guy is and the way we're going to do this is by implementing a machine learning model called
K-means clustering, and I've never implemented this before
So I'm really excited and it falls right in line with the rest of the stuff
I've been talking about in this series, so let's get started
First I've got this little image to show you what the algorithm
Looks like it's a five-step process and this gift kind of shows what that looks like
We start off with data that has no labels. It's just a cluster of
Unlabeled data we don't really know anything about the data other than the features
We don't know what classes the data belongs to that's what we're trying to learn and what the algorithm
Will do is it will iteratively take this data?
That's unlabeled Data and it will create clusters from this data
And we don't know what where these clusters are going to be the algorithm is going to help learn where those clusters should be
Okay, so and I'll talk about what these steps are
But let me define what this term is K-means clustering
This is one of the most popular techniques in machine learning you see it all the time in capital contests in the machine learning subreddit
Everywhere it's a very popular algorithm, and it's very easy
More or less. I mean more than other things that I've been talking about so that's a good thing
But let's talk about what we've learned so far. What we've learned. Is that machine learning is all about optimizing for an objective, right?
We are trying to optimize for an objective
That's that's the goal of machine learning and we've learned about first order and second order optimization
What's first order gradient descent and it's variance right or we are trying to if we were to graph the error of a function?
versus its weight values
we want to find the minimum of the function so that we can find the ideal weight value so that our error is minimized and
To do that with gradient descent we compute the first derivatives right the partial derivatives with respect to our weights using our air
but for second-order optimization
We do the same thing except we compute the second derivative. That is the derivative of the derivative
And there's pros and cons to both and we talked about when you would use one over the other in the previous videos
But what happens? Here's this here's a big question
What happens if you don't have the label? How are you supposed to compute the error, right?
It's usually the predicted label - or sorry the actual label - the predicted label
That's the error use it to compute the partial derivatives with respect to each weight value, but if you don't have the label
How are you supposed to compute the error, and that's where
unsupervised learning comes into play
Specifically K-means clustering, so I've got this diagram here to show the differences here
So there are two outcomes that we could possibly want right?
Either a discrete outcome that is some contained
outcome like red or blue or blue or black or white or up or down or
Not guilty or guilty right there are these containerized outcomes, but our binary outcomes?
not just binary because it could be more than 0 & 1 it could be multi class but outcomes that are containerized into
specific labels whereas continuous Outcomes are
Like Time series Data where it could be a value between 2 & 3 or between 2 & 2.5
or between 2 and point and
2.25 and I could just go infinity in that direction of that of that numerical interval, right?
So with supervised learning we learned how to do predict a continuous outcome using linear regression
That was our first video, and then the next thing we learned was how to predict a discrete outcome
And we use logistic regression for that, and that's when we have labels, but if we don't have labels then we use
Clustering to predict a discrete outcome that's what we're going to do. We're going to predict a discrete outcome and
By defining these classes for people these discrete classes
And then we're going to find the anomaly that is the intruder and so then for a continuous outcome
You'll want to perform
Dimensionality reduction, and that's next week
So we're not we're not but we'll cut we talked about this one and this one and now we're going to talk about clustering
So let's keep going here
So unsupervised versus supervised learning. What are the pros and cons well for supervised learning? It's more accurate
I mean you've got the labels of course it's more accurate right. It's like it's like having training wheels on your bike
But you have to have a human who labelled this data, or it's just labeled itself somehow
but
unsupervised learning is more convenient because you
Because most data is unlabeled right you don't just have this neatly labeled Data like oh
This is this or this no data is Messy the world is Messy life is Messy, so
That's what we want ideally to run our algorithms unsupervised, but the problem is that these algorithms are usually less accurate
Right, but it requires minimum human effort because no one's got to label these these data points by hand
Ok so let's talk about how this algorithm works mathematically
and then we'll
get right into the code ok and so what I'm going to do is I'm going to glaze over most of the code and I'll
Write a few of the most important bits ok so
How does this work so we've got a set of points right so a set of points x and?
so what I'm going to do is I'm going to take this and
I'm going to make two copies of it so we can see this gif while I talk about the algorithm
So let me do this hold on
just like that and put
That here and then put this here ok so
so the way this algorithm works is
we've got a set of points right a set of points x and then and then we define a value K and
K is the number of clusters that we want right K means algorithm
That's where K comes from so we've got a set of points x all of our data points
And we've got k that's our input, and so we're going to say - let's just say - for now
And we'll talk about how to realize what the best K value is let's say - right now
And so then once we have that we're going to place a set of centroids at random locations
How many Centroids K?
Centroids, so if we chose to for k then these centroids are just data points that are just randomly plotted on the graph
Okay, these are they're called centroids because eventually they're going to be the center of each cluster that we learn to be centers
So these Centroid points are there k of them and we just plot them randomly, okay?
So we defined K we have our set of data points and then our set of centroids k of them
And we just plot them randomly ok now what now here's the steps we do and we repeat them until convergence
Which is what we've predefined beforehand with threshold value
So what we do is we say for each point in the data
Set so let's say we have 40 data points. So for each point. Let's say for one of them
we're going to find the nearest Centroid how do we do that well we compute the distance between that data points and and
each of the Centroid points so there's two in our case there's going to compute the distance the euclidean distance between
That points and both of the centroids and is going to find the one that's closest to it
And that's where this arg min function comes in. What is the minimum value?
In this set of values, so we're going to find the shortest distance and that's going to be our cluster
So we're going to assign that data point to the to the cluster J
Where J is for the Centroid that is closest to that data point okay? So then what happens is
We've got a set of clusters now
And we do this for every single Data point
So every single Data point will belong to a cluster and that cluster
Will be defined as a centroid point that is closest to that data point okay? So that's the initial cluster
That's going to be defined then for each of those clusters J. We're going to take all of those data points in that cluster
We're going to add them all up and then divide by the number of them and what is this called?
It's called the mean right or the average
so now you're getting to see where this name comes from right K-means, right it all makes sense um I got a little okay, so
K-means is the name of the algorithm, and so we find the mean
Point or the mean value for all the points in the data set and that mean is going to be become our next
Centroid point okay, so we've defined Centroids we've grouped our data point into each of these centroids and
Then we will then find the mean of all those values and that will those values will be our new centroids in our case there
Will be two rights
Ob there'll be two new centroids that we then plot and then we just keep repeating the process so we go back to
For each point x and for those new Centroids find the distance for all the closest data points
And it's going to be a new cluster, and so that's what you're seeing here
it's going to be a new cluster, and we just keep repeating that process until
When until none of the cluster assignments change and then we're good
so right, so that's kind of how that works and so uh
Right so we can also terminate the program when it reaches an iteration budget
So we'll say after you know x number of iterations
Just stop running right, so then one a one great question that I hope you're asking is
How do we know what value for k? We should use well?
It's very simple if we know what classes we want to classify or how many classes?
We'll just say that that's the value for K so if we know that we want to classify
people as either
From a certain region or from a certain place then we'll just say like of these three places
You know Spain Mexico, and I don't know Argentina
Random the Countries was I guess those are on my mind
Then we know that k should be three because we have three countries that we're targeting
But if we don't know how many classes we want these are just unknown unknowns
then we and we'll have to decide what the best K value is and
That can be a guess and check method, but there's actually a smarter way to do that and it's called the elbow method
So the elbow method is a very popular method what you have to do is if to rub your elbow
No, I'm just kidding. So what happens is you've got this graph here, okay, and it looks like an elbow, right?
It's got this elbow point right here
So what you do, and I've got this part in Javascript to show you what this algorithm. Looks like I?
know Javascript
Right so here's what it looks like we perform K-means once so we so we define a set of K values
Between 1 and 10 okay, so that's a good starting point K could be either 1 cluster, or it could be 10 clusters
Let's try K-means for all ten of those and so what we do is we perform
Teh mean
For all those k values and then so that's what this is so for as many k values as we have between 0 and 10
Let's perform K-means to find all those clusters and then for each of the clusters
Let's find the mean value and what is the mean value it is the error value?
so what is the distance between the Centroid and all of its points and
That's going to be the mean or the error and what we want to do is we want to compute the sum of the squared
error values
And so that's what this line is right here the sum of the squared errors and so we say
Matthau which means square the data point for each data point minus the mean so for every data point in our
Data set we're going to take a data point minus the mean value and square it and so what that's going to do is if
we were to graph that for all of those number of cluster values for all those K values will see that that sum of the
squared errors for each of those iterations of K-means
makes this elbow like graph and what we
Want to do is we want to pick the k value that is right at the elbow right at that tipping point right here?
So it would be 6 in the case of this graph, and that is our optimal K value
Okay, because after that there's very diminishing returns as you can see we want to find the minimal error value and we found that for
this K value of 6 the error is
It's not as smallest, but at the point where it's everything after that is just diminishing returns
And we could say 10 or 12 or 14
But then for computational efficiency sake we could just say 6 so we don't have to run that many iterations
So we'll just say 6 okay. So that's the elbow method all right, so
three more points and then were going to get started with the code, so
How is the distance computed between the centroids in the Data Points the Euclidean distance, right?
We've talked about the euclidean distance for linear regression right euclidean distance. It's a very simple
distance formula, You just take each of the data points and then you say
x 1 minus x 2 plus
Squared plus y 1 minus y 2
Squared and if you have V values v 1 minus v 2 squared and the square root of all of us
And that's the euclidean distance and what that does is it take two data points?
And it gives us a scalar value, and that is the distance between all those data points, so you can apply the euclidean distance
to a
multi-dimensional
Data set where you have you know any number of dimensions?
You just subtract the values for each of those dimensions for each of those data points
The square and then you find a square root anyway, I'll show you the equation when we get to it
but yeah, the euclidean distance and to answer any questions about why the euclidean distance as opposed to other distance metrics the answer is because
K means, technically minimizes the within cluster Variance, and I know we haven't talked about this term yet
And we will but I'm just trying to be very detailed here
And if you look at the definition of variance which I'll define more in detail later
It is a it is identical to the sum of the squared euclidean distances from the center
So in euclidean space and because it is identical to the sum of Euclidean distances
We use the euclidean distance as our distance metric as opposed to something else with the Manhattan distance. Okay, so
Last for you or two more actually when should you use this if your data is numeric
And that means if your features are numeric right you have numbers for your features if you have a categorical feature like of the shoe
color
or
True or false a boolean value you can't really map that in euclidean space right these are these are categorical features?
So, but if your data is numeric K-means is your is your is your algorithm. It's also the simplest algorithm
You'll see it's actually a very simple algorithm. We talked about the pSeudocode. It's a relatively simple algorithm and the
Advantage it has over other techniques. Is that it is fast. That's the real key value
It's simple in its fast you know quick and dirty
Clustering algorithm. It's great for that, okay?
And it really shines when you have multivariate data. So that is more than one dimension okay, so
Lastly two other examples. I've got here one for fraud Detection and then one for MST without labels
I know what that is possible misd without labels yes, it's possible anything is possible
So anything is really possible here, so for credit fraud Detection and for finding these labels for these
For these mis t images work. Let's just say we don't know the label usually we know the labels well
let's just say we don't know the labels if we don't know the labels and it's going to cluster the images into what their
What their respective clusters should be and those are clusters for one then twos and threes well in terms of the image the images
Okay, so check those other two examples out and now let's get into the code okay, so I?
Move in here. I'm moving okay, so
The first thing we want to do is we're going to import numpy for Matrix, Mass
Map plot live to map plot live to plot out our graph and then the animation
Module of map plot live because we're going to graph we're going to animate some graphs in a second, okay?
so then let's take a look at our data set okay, so
Where is our data set here our data set is is it what our data set looks like okay? So we've got two dimensions
We've got two features in our data set and the feature on the left is how many packets are sent per second and the feature?
On the right is what's the size of a packet okay?
So what we're trying to do is we're trying to detect the anomaly and that is a dDos or if you know
what ddos thing is it's basically flooding a
server with packet requests until the server goes down so try and see how many packets are sent per second from this user and
Then what is the size of that packet okay?
and that that's that those are our data points, and we have a few of these data points okay, so
That's it just to feature through our data and we'll talk about like I said dimensionality reduction
Later on if we have a million features
How do we reduce it to two or three so that we can visualize it?
So that's our data set that we want to load okay, so then here's what we got here
This is the euclidean distance now. This is the formula that I was talking about so given two data points p and Q
Take each of the values
so if it has you know q it could be x it could be y z to each of the values
subtract them to get the difference square it and then add them all together and then square root that whole
Equation and so that's what the sigma notation Denotes. It's the sum of a set of values
We're starting at I equals 1 for N. Values find the difference difference between all the data all the
feature
values in a Data point
Squared and then find the square root of all of that for the sum of the sum of the squared errors, okay?
Technically the sum of the squared errors yeah, and that's the euclidean distance
so that's the euclidean distance and
So now let's look at the actual algorithm itself so what I've done here is I've defined its
hyper its hyper parameters, and then I'm going to code out the algorithm itself okay, so
For K-means, we've got a k value that is a number of clusters and since we we're just going to say to okay
We have two clusters that we want and we have an ePsilon value and the ePsilon value is it's a zero. It's a threshold
it's the minimum error to be used in the stop condition okay when we want to stop training okay when our error is zero and
Then we have our distance or what what type of distance we want to compute the euclidean distance okay? So let's keep going
let me make it a little bigger, so
What we're going to first do is store the path centroids and this history of centroids lists now
This is not really a part of the algorithm
This is just for us so we can then graph how the centroids move over time later on so we can visualize it, okay?
So then we're going to say okay the distance metric is going to be euclidean
So we define it right here as disk
Method that variable and then we set the data set so we load up that data set from txt
We've got those two dimensions right the amount of packets that are sent and then the size of each packet
Okay, so we've got our data set. We've defined a distance metric, and now we're going to say okay
Let's get the number of instances and the number of features from our data set by taking this shape
Attribute of our data set so our rows and our columns or rows are going to be the number of Data points we have
And there are columns are going to be the number of features. We have so we want to store those two values
Then we're going to Define K
Centroids right so like I said we randomly plot these centroids on the graph
and we're they're going to be how many centroids they're going to be K centroid special k through serial okay, so
Okay, so we're going to say our data set is going to be we're gonna say okay so from our data set how many clusters?
Do we want to find right?
chosen randomly
So we're going to say size K
So we're going to find a random number between zero and a number of instances that we have that that is the number of data
Points minus one of size K okay
And so we're going to store that the the Centroid values in this prototypes
variable right, so we've got our centroids and then
We want to set these to our list of passed centroids as well
So I'm going to take those centroids that we just defined randomly and set them to our history
Centroids list so then we can just keep a copy of it over time
And we'll keep adding our centroids that are calculated to this history centroids list so we can graph it later
You know for our own visualization, okay?
And then we have our prototypes old a list which is going to be initialized as a bunch of zeros. It's empty
It's an empty vector
or tensor and
So that's going to keep track of centroids every iteration right so we've got our prototypes
which is our current and our prototypes old which is our the set of
Centroids that we had before and those two are what we're going to use for the actual algorithm the histories
Or the history Centroids
It's just for us to see how it changes over time, okay?
And then we have one more list
And that is the belongs to list to store the clusters over time the clusters themselves all the data points contain in a cluster
Okay, and then we have our distance method which is going to take the current prototypes
And then the prototypes old and the distance between them and it's going to store that in the norm
Variable okay, and then the number of iterations which will start off as zero
okay, so now let's go ahead and
Write out this algorithm Shall
We so we're going to say okay
So while the norm the distance is greater than ePsilon where epsilon is zero in our case?
We're going to say well the number of iterations. We're going to add one because this is our first iteration
We have begun training or model, so we'll say ok well the iteration is going to be one and so let's compute
What the norm is so given our prototypes where we are currently?
The distance between those prototypes and the prototypes that we had before and that's going to be our norm
Then we're going to say okay, so for each
Instance in the data set so let's go through every single instance, so we'll say okay
well
What is the index of an infant and what is the actual value of that instance those are the two?
Variables that we're going to use to enumerate or iterate over the data sets
We'll say let's define a distance vector of size K
So now it's time for us to define what this distance vector looks like so we'll say okay
So the distance vector is going to first of all be initialized as a set of zero values of size K
Okay, so it's going to be empty at first and we're going to add
Value to it over time okay, so let's say okay, so for each Centroid value. We have our
Random Centroid values, right we already computed the values randomly at the beginning, right?
So we're going to say okay, so for each for each Centroid value. Let me just make this a
comment
For each Centroid value so for the for the prototype and there's a stored in our prototype
So we'll say for each index and then the actual value for the prototype
Let's go ahead and do this for all those prototypes, so for each instance in our data set and then for each Centroid
so we have two nested for-loops right and
then for each Centroid and numerate
prototyPes
We're going to say we want to compute the distance between what we have to do which compute the distance between each
Data point and its closest and each Centroid so for every data point
We're going to compute the distance between its and every other centrally and we want to find the minimum distance
Centroid that closest centroid to it so we can put it into that specific cluster, so they compute the distance between
X and what x is a data point and Centroid
So we'll say distance vector is going to be index
prototype, so
for the Centroid
value so here's we're going to do we're going to compute the distance using distance method between the prototype which is the Centroid and
since value, so for all those x values for all those data points
We're going to compute the distance and store it in a dis inspector
List and then it's going to be indexed by the indexed prototype, okay?
and then
We're going to do that for all of them. So that's why we have a nested for loop
So that's going to store all of those distances in that distance vector method
And then what we want to do is we want to find the smallest
Distance and assign that distance to a cluster, so we're going to say okay. What is the what what?
Cluster does each this does each a data point belong to and we'll say well
we'll define it and it belongs to array and
Define it by its index instance
And so now we'll do their argument step or we'll find a minimum value from between all those distance vectors that we calculated okay
Between all of those distance values and we use the euclidean distance that we defined previously in its own
Function and so it's going to say okay, so that that we're going to store them all in there
and then we're going to say okay, so then
We have a list of temporary prototypes
Oh, those are temporary
Centroids because we're going to update our centroids in a second and we want to save where we are right now
So we could store it in our history list so say
We'll create a little
Tensor right here, and it's going to be of size K by the number of features
Right because that's the number of clusters we have and we're going to say
for each cluster
Hold on
for each cluster and our k of them let's say okay, so for
For each cluster, and there are k of them?
Let's go and say let's get all the points assigned to a cluster
Look at all those points assigned to a cluster right so we're going to say okay
I'm just going to paste it in for it because it's a little faster
We only have a little bit left, okay?
So for each cluster and there are k of them we're going to get all the points assigned to a cluster, okay?
And that's what this what this line does right here
We're going to get all those points assigned to a cluster and store them in instances close, okay?
So those instances closed all the data points within a specific cluster
And we're doing this for each of the cluster server right and for each cluster k of them
We're going to find a mean of those data points, and so now you get the K-means part for K clusters
compute the mean of the data points in that cluster using NPN mean
so that is the average where we add them all up and then divide by the number of them for all of those instances and
store that mean in prototype
And then we're going to that going to our new Centroid so at our new Centroid to our temporary
Prototypes list and we did that so we can then take that temporary prototype?
And it sign it to our main prototype list
Ok so then with the temporary prototype variable acts as a buffer as we're computing new Centroids to store them
And then we can add it to our main
Prototypes variable and then we have our history centroids list which will also append to just for us to calculate this for us to visualize
How this enjoyed move over time later on and you'll see exactly what I mean by that at the end of this we return our
Calculated centroids, and that is at the end of our algorithm when we have reached
convergence and there are history of centroids all those centroids that we've calculated over time and then belongs to which is the
All the clusters or that's where we cluster all those data points
And then the index is going to be the cluster that it belongs to that K cluster
And so that's how we define all those data points in a single cluster
Ok so that's it for the algorithm, and then we're going to graph what this looks like so in are plotting algorithm
We're going to say ok where we're going to have two colors red and Green for our centroids
We'll define them here
Well split our graph up by its axi axis
And it's actual plot and then for each point in our data set and we've got several points in our data set
We're going to say ok we want to graph all these points by their
Cluster color which we've which we're going to define so all the all the data points in that one cluster are going to be one
Color and all the Data points in another cluster are going to be another color okay, so then
We'll get all those points by looking in D belongs to list ok by its index
Right so for the index for each point in our data set get the instances those are all the data points that
To a specific cluster and then assign each data point in that cluster a color and plot its own respective color okay?
And so that's what we do here and then we're going to log the history of Centroids remember, how I said
How we want to see what the history of centroids that are calculated are so we can see how it changes
That's what we're doing here
So in the history points list we'll say for each Centroid ever calculated
Print them all out and then plot them in their own graph, okay?
So that's what's happening in the method, and then we're going to actually execute it right we've written methods for our K-means algorithm
We've written a method to
Plot out our results and now we can execute these methods
So we're going to load our data set train the model on that data
Where k equals two so we want two clusters?
and then it's going to give us back our
Computed centroids the history of all the centroids that were computed over time and in our list of data points
defined by the cluster that each belongs to
And so we could take those three values and then use them as parameters for our plotting function
And that's going to plot our result and so when we run that we're going to get these two
Clusters, right so it's if we didn't run our k means algorithm, and we just plotted our data
It would look just like this except there would be no color because we all know what the clusters are but what happened was?
the Algorithm learned that that that the right classes were red and
green for B for these for these respective clusters and
The Blue points are the centrist and center points and so those are the centroids?
Okay, and so over time. What happens is look so here's here's what I mean by over time over time
What happens is it learns the ideal Center points for the center for the Centroids?
Okay, it starts it learns these ideal center points over time the first if they're randomly blue and then it gets better and better iteratively
It finds better and better
Center points find the most optimal center point for the cluster that we're going to learn and then it plots it okay
So that's it please subscribe for more programming videos and for now. I've got to find my sense so thanks for watching
