K-Means Cluster Analysis

K-Means Cluster Analysis

(Statistical Analysis > Cluster Analysis K-Means)

Description

K-Means Cluster Analysis aims to partition the numeric matrix of data points into k groups such that the sum of squares from data points to the assigned cluster centres is minimized. At the minimum, all cluster centres are at the mean of their Voronoi sets (the set of data points which are nearest to the cluster centre).

Inputs

Name – Cluster Analysis K-Means
Cluster Analysis K-Means Dataset Input – select a dataset that contains the variables of interest.
Cluster Analysis Variable List – a set of independent variables.
Cluster Analysis Algorithm – the algorithm to be used. This must be one of “Hartigan-Wong”, “Lloyd”, “Forgy” or “MacQueen”.
Cluster Analysis Centres – the number of clusters.
Cluster Analysis Nstart – how many random sets should be chosen?
Cluster Analysis Max. Iterations – the maximum number of iterations allowed.

Outputs

Output includes the following:
Clustering vector indices, classes : a vector of integers (from 1:k) indicating the cluster to which each data point is allocated.
Cluster means : a matrix of cluster centres (means).
Total cluster ss (sum of squares) : the total sum of squares.
Within cluster ss (sum of squares) by cluster : vector of within-cluster sum of squares, one component per cluster
Total within cluster ss (sum of squares) : total within-cluster sum of squares.
Between cluster ss (sum of squares): The between-cluster sum of squares.
K-Means clustering with # of sizes : the number of data points in each of # clusters.

Advanced
The following algorithm is used in the implementation of the K-Means Clustering.

• Randomly select K cluster centres
• Repeat
o Assigning data points to the nearest cluster centre with the sum of squares from data points to the cluster centres is minimised
o Re-randomly select K cluster centres
• Until there is convergence or the maximum number of iterations has been
reached

References

(1) An Introduction to R (3.1.0 (2014-04-10)).
(2) Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications, Biometrics 21, 768–769.
(3) Hartigan, J. A. and M. A. Wong (1979) A K-means clustering algorithm, Applied Statistics 28, 100–108.
(4) Lloyd, S. P. (1957, 1982) Least squares quantization in PCM, Technical Note, Bell Laboratories, Published in 1982 in IEEE Transactions on Information Theory 28, 128–137.
(5) MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297, Berkeley, CA: University of California Press.