Today’s topic of the day: What is kMeans??
Imagine you have 100 boxes, but you don’t know what is inside them  the boxes are not labelled! You still want to arrange them, though. So, you try to group similar items together by different characteristics, into groups (aka clusters).
What you just read is unsupervised learning. The data (in this case the boxes) is not labelled. Even if you say, “I think this box should be in cluster A”, there is no correct answer for you to know for sure! Hence, unsupervised.
Trying to group them together into clusters is called clustering (I think you knew that).
kMeans is a method to perform clustering  i.e. a way to split the boxes into groups.
How does kMeans actually work?
Credits: http://enddl22.net/wordpress/category/software/opencvsoftware
a) This is your unlabelled dataset (all green). Decide on the number of clusters to split the data into: k. In this case, we chose k=2.
b) Randomly place the k number of points in the dataset; in this case, two (one red cross, one blue cross). These are your ‘centroids’  assign every green point to the centroid nearest to them (assignment).
c) This is the state after assigning each green point to their nearest centroid. Now, calculate the average of all the blue points, and move the blue centroid to this average point (updating). Do the same for the red centroid.
d) This is after shifting the blue and red centroids. Now, reassign each point to the color of their nearest centroid (assignment).
e) This is after reassigning each point to their nearest centroid. Now, update the position of the centroids again.
The process of assigning and updating carries on until the assignment of the points do not change anymore. This is when the total sum of squared distance from each data point to tthe centroid is minimised.
Below is an interesting animation of kMeans in action:
Credits: http://shabal.in/visuals/kmeans/2.html
The animations here are good too, and you see it happening step by step!
In fact, from there you might realise some of the limitations of kMeans:

Necessary to pick the number of clusters (k). How do we know what number to pick? We may sometimes imagine our own clusters.

The starting points of the centroids may affect the eventual results  different clusters are defined!

kMeans is sensitive to outliers  it may greatly skew the centroid. The centroid can be likened to the common ‘Mean’ (average) that we know, and the medoid to the ‘Median’.
Credits: https://www.slideshare.net/anilyadav5055/15857cse422unsupervisedlearning
Best practices
Though there are limitations, people still use kMeans because it is simple and computationally efficient, and it is intuitive to understand. In that case, how do we best use kMeans?
1) Always visualise your data if possible. This may improve your choice of k before you start, and check that the classifications actually make sense after you are done!
2) Use metrics to check:
 Inertia: sum of squared errors for each cluster (low inertia = dense cluster)
 Silhouette Coefficient: measure of how close the data points are to their own cluster, than to their nearest neighbour (you want to be close to your own cluster, and far from your neighbours!). Ranges from 1 to 1: the closer the number is to 1, the better it is (the closer it is to its own cluster, and further from other clusters).
3) Check out alternative methods; there are many other methods out there to perform clustering  as a brief overview see the below, or check out this slideshow:

Partitional  kMeans, kMedoids: divide into kpartitions, repartition to get better clustering. kMedoids is less sensitive to outliers as compared to kMeans.

Hierarchical  Agglomerative, Divisive: bottom up  each point starts as its own cluster, and merges into pairs and bigger clusters as it moves up; top down  all point start in one cluster, and divides into smaller clusters progressively. Good in that it does not require the number of clusters to be defined at the start  decide only after the dendogram is produced.
Credits: http://www.saedsayad.com/clustering_hierarchical.htm 
DensityBased  DBSCAN): start randomly from one point and continue to grow a cluster as long as the density of the cluster exceeds a threshold. Here is a cool, interactive visualisation of DBSCAN in action. DBSCAN is good for nonlinearly separable clusters.

GridBased  STING: partition the entire space into a number of cells with a grid structure. Most suitable for large datasets  fast processing time.
Others: