k-means clustering

Non-parametric and unsupervised

k-means clustering is a non-parametric method for clustering data into groups. k-means clustering is an unsupervised algorithm because data points are unlabeled.

$k$ - the number of clusters

The variable $k$ represents the number of groups or clusters the algorithm will create. This value $k$ is something we define arbitrarily and it should be smaller than the number of data points $n$.

Centroids

Every cluster will have associated data points. A mean data point of a cluster is called a centroid.

A centroid is calculated every time except at the start when centroids are set as random data points.

You can think centroid represents a specific collection of features.

No bias

The k-means algorithm has no bias. The algorithm creates clusters automatically with no predefined assumptions.

Algorithm steps

There are three steps:

• data assignment: split anonymous data set into k clusters by randomly selecting k data points for the centroids.

• centroid update: recalculate centroids as the mean of all data points of a cluster.

• repeat assignment and update steps until some criteria is met, usually when centroids don’t move any more.

How to choose $k$?

The best number of clusters $k$ will have the greatest separation (distance) and we need to compute it from the data.

For different values of the number $k$ we determine the corresponding validation error. We pick the $k$ with the smallest validation error.

How to calculate the k-means clustering error?

First we calculate the centroids:

$c_1 = \frac{x_1+x_2+ \cdots + x_n}{n}$

where $x_1, x_2, \cdots ,x_n$ are data points.

Then we calculate the errors per cluster based on centroids.

$e_1 = \sum_{i=1}^n \mid \mid x_i - c_i \mid \mid$

The total error is the sum of errors from all the centroids:

$J = \sum_{j=1}^k \sum_{i=1}^n \mid \mid x_i^{(j)} - c_j \mid \mid ^2$

Hard vs. soft clustering difference

k-means is what is known as hard clustering. Each data point is part of a single cluster.

Soft clustering is a probability approach. Probability is assigned for the data point to belong to a certain cluster. GMM (Gaussian Mixture Models) is an example of such clustering.

When k-means will fail?

k-means clustering will provide bad results when:

• the data contains outliers
• data is low dimensional

tags: data analysis - clustering & category: machine-learning