kmeans clustering
 Nonparametric and unsupervised
 $k$  the number of clusters
 Centroids
 No bias
 Algorithm steps
 Few important question about kmeans
Nonparametric and unsupervised
kmeans clustering is a nonparametric method for clustering data into groups. kmeans clustering is an unsupervised algorithm because data points are unlabeled.
$k$  the number of clusters
The variable $k$ represents the number of groups or clusters the algorithm will create. This value $k$ is something we define arbitrarily and it should be smaller than the number of data points $n$.
Centroids
Every cluster will have associated data points. A mean data point of a cluster is called a centroid.
A centroid is calculated every time except at the start when centroids are set as random data points.
You can think centroid represents a specific collection of features.
No bias
The kmeans algorithm has no bias. The algorithm creates clusters automatically with no predefined assumptions.
Algorithm steps
There are three steps:

data assignment: split anonymous data set into k clusters by randomly selecting k data points for the centroids.

centroid update: recalculate centroids as the mean of all data points of a cluster.

repeat assignment and update steps until some criteria is met, usually when centroids don’t move any more.
Few important question about kmeans
How to choose $k$?
The best number of clusters $k$ will have the greatest separation (distance) and we need to compute it from the data.
For different values of the number $k$ we determine the corresponding validation error. We pick the $k$ with the smallest validation error.
How to calculate the kmeans clustering error?
First we calculate the centroids:
\[c_1 = \frac{x_1+x_2+ \cdots + x_n}{n}\]where $x_1, x_2, \cdots ,x_n$ are data points.
Then we calculate the errors per cluster based on centroids.
\[e_1 = \sum_{i=1}^n \mid \mid x_i  c_i \mid \mid\]The total error is the sum of errors from all the centroids:
\[J = \sum_{j=1}^k \sum_{i=1}^n \mid \mid x_i^{(j)}  c_j \mid \mid ^2\]Hard vs. soft clustering difference
kmeans is what is known as hard clustering. Each data point is part of a single cluster.
Soft clustering is a probability approach. Probability is assigned for the data point to belong to a certain cluster. GMM (Gaussian Mixture Models) is an example of such clustering.
When kmeans will fail?
kmeans clustering will provide bad results when:
 the data contains outliers
 data is low dimensional
…
tags: data analysis  clustering & category: machinelearning