Today’s topic of the day: What is k-Means??
Imagine you have 100 boxes, but you don’t know what is inside them - the boxes are not labelled! You still want to arrange them, though. So, you try to group similar items together by different characteristics, into groups (aka clusters).
What you just read is unsupervised learning. The data (in this case the boxes) is not labelled. Even if you say, “I think this box should be in cluster A”, there is no correct answer for you to know for sure! Hence, unsupervised.
Trying to group them together into clusters is called clustering (I think you knew that).
k-Means is a method to perform clustering - i.e. a way to split the boxes into groups.
How does k-Means actually work?
a) This is your unlabelled dataset (all green). Decide on the number of clusters to split the data into: k. In this case, we chose k=2.
b) Randomly place the k number of points in the dataset; in this case, two (one red cross, one blue cross). These are your ‘centroids’ - assign every green point to the centroid nearest to them (assignment).
c) This is the state after assigning each green point to their nearest centroid. Now, calculate the average of all the blue points, and move the blue centroid to this average point (updating). Do the same for the red centroid.
d) This is after shifting the blue and red centroids. Now, reassign each point to the color of their nearest centroid (assignment).
e) This is after re-assigning each point to their nearest centroid. Now, update the position of the centroids again.
The process of assigning and updating carries on until the assignment of the points do not change anymore. This is when the total sum of squared distance from each data point to tthe centroid is minimised.
Below is an interesting animation of k-Means in action:
The animations here are good too, and you see it happening step by step!
In fact, from there you might realise some of the limitations of k-Means:
Necessary to pick the number of clusters (k). How do we know what number to pick? We may sometimes imagine our own clusters.
The starting points of the centroids may affect the eventual results - different clusters are defined!
k-Means is sensitive to outliers - it may greatly skew the centroid. The centroid can be likened to the common ‘Mean’ (average) that we know, and the medoid to the ‘Median’.
Though there are limitations, people still use k-Means because it is simple and computationally efficient, and it is intuitive to understand. In that case, how do we best use k-Means?
1) Always visualise your data if possible. This may improve your choice of k before you start, and check that the classifications actually make sense after you are done!
2) Use metrics to check:
- Inertia: sum of squared errors for each cluster (low inertia = dense cluster)
- Silhouette Coefficient: measure of how close the data points are to their own cluster, than to their nearest neighbour (you want to be close to your own cluster, and far from your neighbours!). Ranges from -1 to 1: the closer the number is to 1, the better it is (the closer it is to its own cluster, and further from other clusters).
3) Check out alternative methods; there are many other methods out there to perform clustering - as a brief overview see the below, or check out this slideshow:
Partitional - k-Means, k-Medoids: divide into k-partitions, repartition to get better clustering. k-Medoids is less sensitive to outliers as compared to k-Means.
Hierarchical - Agglomerative, Divisive: bottom up - each point starts as its own cluster, and merges into pairs and bigger clusters as it moves up; top down - all point start in one cluster, and divides into smaller clusters progressively. Good in that it does not require the number of clusters to be defined at the start - decide only after the dendogram is produced.
Density-Based - DBSCAN): start randomly from one point and continue to grow a cluster as long as the density of the cluster exceeds a threshold. Here is a cool, interactive visualisation of DBSCAN in action. DBSCAN is good for non-linearly separable clusters.
Grid-Based - STING: partition the entire space into a number of cells with a grid structure. Most suitable for large datasets - fast processing time.