K-means Clustering & its real use-cases

M G GOVARDHAN GOWDA
5 min readSep 8, 2021
K-means Clustering & its real use-cases in the Security domain

Machine Learning tasks can be performed in two ways:
1. Supervised Learning (Labeled Data)
2. Unsupervised Learning(Unlabeled Data)

Supervised Learning

A task is supervised if you are using labeled data we use the term labeled to refer to that data already contains the solutions, called labels.

Unsupervised Learning

A task is considered to be unsupervised if you are using unlabeled data. This means you don’t need to provide the model with any kind of label or solution while the model is being trained.

Major steps in Machine Learning Process
Step1: Define the Problem
Step2: Build the Dataset
Step3: Train the model
step4: Evaluate the model
Step5: Use the model

Now time to understand what is K-mean Clustering

K-means clustering is an unsupervised learning algorithm. In this case, there is no well-defined dataset unlike supervised learning we have labeled data.

In this, we have a set of data where we have to group them as the name suggests we want to put them in the cluster. By this, I mean putting objects together which are similar in nature or have similar characteristics. So that is what K-mean Clustering is all about.

The term “K” is a number, we are basically telling the system how many clusters we want.

Lets us consider an example we will identify the bowlers and batsmen with K-means clustering. In layman term’s we have the list of players and we want to define that list in two groups bowlers and batsmen.

Hence it is quite obvious that batsmen will have a higher number of runs and bowlers will have more wickets.

On Y-axis we have runs made and on the X-axis we have wickets were taken. Here the value of “K” is 2 which means we have 2 clusters of batsmen and bowlers.

Now we will add the concept of the centroid in between. Every cluster will have its own centroid value. On the basis of Centroids values, we will group up the data present in the list.

Now the next step is to calculate the distance of each of the data points from each of the randomly assigned centroids. For every point, the distance is measured from both the centroids. Then whichever distance is then that data point is assigned to its nearest centroid. The distanced between the observed value and Centroid value is called Euclidean distance

If the clusters are not stable then repositioning of the centroids takes place until the clusters are fully stable.

Code for K-means Clustering:

Importing the various lib that is required which

Now we are creating Blobs which creates Clusters of Dataset which is readily available in Scikit learn.

To know more about Scikit learn library Click below:

scikit-learn

“We use scikit-learn to support leading-edge basic research […]” “I think it’s the most well-designed ML package I’ve…

scikit-learn.org

We have stated we center =3 which means we want 3 test kind clusters as you can distinguish there are 3 clusters.

After that importing random data from make_blob and creating clusters or our instances and we use a fit keyword like any other Machine Learning model to train the model.

y_kmeans is the value of data points from the centroids.

Final Output:

K-means use-cases in the security domain

The data archive is to preserve so that it stays there for future studies and generations and is not perishable unless one delete’s it. There are a quite few ways that data can be lost from a file. It might get deleted accidentally or maliciously deletion of data. While there is a lot of software available that can look a watch for such specific known threats on an operating system, this software detects unique anomalous behavior, such as random file removal patterns.

Our approach to detecting this kind of problem is machine learning. We can create a machine learning model and make trains in such a way that it can understand the normal behavior of deletion and mark it in its model and if anything outside this comes in treat it as an outliner.

This is called data inspection, anything that is outside the norm.

We have trained the file deletion patterns and implemented a k-means clustering solution to detect anomalous file deletions. This approach can also be used to detect other anomalies.

Unsupervised learning is often used in the field of anomaly detection, e.g. detecting security breaches, where labeled data is unavailable.

This technique identifies groups or clusters of similar data and can be used to identify anomalous events (outliers).

Find code here: https://github.com/govardhanmg/K-Means-Clustering-main

--

--

M G GOVARDHAN GOWDA

MLOPS internship trainee @LinuxWorld informatics Pvt. LTD. || Student @Dayananda Sagar University |