Introduction

The goal of this method is to partition the points into $ K$ groups such that the sum of squares from points to the assigned cluster centers is minimised.

The method works as follows [2]:

  1. Partition the items into $ K$ randomly selected clusters
  2. Proceed through the list of items, assigning an item to the cluster whose centroid is nearest. The distance is computed using the Euclidean metric

    $\displaystyle d(\mathbf{x,y})=\sqrt{\sum_i (x_i-y_i)^2}.$    

    Recalculate the centroid for the cluster receiving a new item and for the cluster loosing an item.
  3. Repeat step 2 until no more reassignments take place.

    The within sum of squares for cluster $ k$, SS$ _k$, is defined as follows:

    $\displaystyle \mathrm{SS}_k=\sum_{i=1}^{n_k} (x_{ik}-\bar{x}_k)^2,$    

    where $ n_k$ is the size of cluster $ k$.

    Note that to check the stability of the clustering it is desirable to rerun the method, such that a new initial configuration is used. Since the number of clusters is also unknown, it is also a good idea to rerun the method using different values of $ K$. According to [1] there are strong arguments for not fixing the number of clusters in advance:

    1. If two or more seed points inadvertently lie within a single cluster, their resulting clusters will be poorly differentiated.
    2. The existence of an outlier might produce at least one group with very disperse items
    3. Even if the population is known to consist of $ K$ groups, the sampling method may be such that data from the rarest group do not appear in the sample. Forcing the data into $ K$ groups would then lead to nonsensical clusters.

Bjørn Kåre Alsberg 2006-04-06