SciPy Clustering

SciPy Clustering

K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data. Intuitively, we can think of a cluster as consisting of a set of data points whose distances to each other are small compared to the distances to points outside the cluster. Given an initial set of K centers, the K-means algorithm iterates the following two steps.

  • For each center, determine the subset of training points (its cluster) that is closer to it than the other centers.
  • Calculate the mean of each feature of the data points in each cluster; this mean vector becomes the new center of that cluster.

These two steps are repeated until the centers no longer move or the assignments no longer change. Then, a new point x can be assigned to the cluster that is closest to the prototype. The SciPy library provides an excellent implementation of the K-means algorithm through the cluster package. Let’s learn how to use it.

K-Means Implementation in SciPy

We’ll learn how to implement K-Means in SciPy.

Importing K-Means

We’ll see the implementation and usage of each imported function.

from SciPy.cluster.vq import kmeans,vq,whiten

Data Generation

We must simulate some data to explore clustering.

from numpy import vstack,array
from numpy.random import rand

# data generation with three features
data = vstack((rand(100,3) + array([.5,.5,.5]),rand(100,3)))

Now, let’s examine the data. The above program will produce the following output.

array([[ 1.48598868e+00, 8.17445796e-01, 1.00834051e+00],
[ 8.45299768e-01, 1.35450732e+00, 8.66323621e-01],
[ 1.27725864e+00, 1.00622682e+00, 8.43735610e-01],
…………….

Normalize a set of observations on a per-feature basis. Before running K-Means, it is helpful to whiten each feature dimension of the observation set. Each feature is divided by its standard deviation across all observations to give it unit variance.

Whitening the Data

We must whiten the data using the following code.

# whitening of data
data = whiten(data)

Computing K-Means with Three Clusters

Now let’s compute K-Means with three clusters using the following code.

# computing K-Means with K = 3 (2 clusters)
centroids,_ = kmeans(data,3)

The above code performs K-Means on a set of observation vectors, forming K clusters. The K-Means algorithm adjusts the centroids until insufficient progress is made, that is, the change in distortion since the last iteration is less than a threshold. Here, we can observe the centroids of the clusters by printing the centroid variable using the code given below.

print(centroids)

The above code will produce the following output.

print(centroids)[ [ 2.26034702 1.43924335 1.3697022 ]
[ 2.63788572 2.81446462 2.85163854]
[ 0.73507256 1.30801855 1.44477558] ]

Assign each value to a cluster by using the following code.

# assign each sample to a cluster
clx,_ = vq(data,centroids)

vq compares each observation vector in the ‘M’ by ‘N’ ‘obs array to the centroid and assigns the observation to the closest cluster. It returns the cluster and distortion for each observation. We can also check the distortion. Let’s check the cluster for each observation using the following code.

# check clusters of observation
print clx

The above code will produce the following output.

array([1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 2, 0, 2, 0, 1, 1, 1,
0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 0, 0,
2, 2, 2, 1, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, (2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2 ...

Leave a Reply

Your email address will not be published. Required fields are marked *