Skip to content Skip to sidebar Skip to footer

Understanding Output From Kmeans Clustering In Python

I have two distance matrices, each 232*232 where the column and row labels are identical. So this would be an abridged version of the two where A, B, C and D are the names of the p

Solution 1:

You have two issues where, and the recommendation of k-means probably was not very good...

  1. K-means expects a coordinate data matrix, not a distance matrix.

    In order to compute a centroid, it needs the original coordinates. If you don't have coordinates like this, you probably should not be using k-means.

  2. If you compute the difference of two distance matrixes, small values correspond to points that have a similar distance in both. These could still be very far away from each other! So if you use this matrix as a new "distance" matrix, you will get meaningless results. Consider points A and B, which have the maximum distance in both original graphs. After your procedure, they will have a difference of 0, and will thus be considered identical now.

So you haven't understood the input of k-means, no wonder you do not understand the output.

I'd rather treat the difference matrix as a similarity matrix (try absolute values, positives only, negatives only). Then use hierarchical clustering. But you will need an implementation for a similarity, the usual implementations for a distance matrix will not work.

Solution 2:

Disclaimer: below, I tried to answer your question about how to interpret what the functions return and how to get the points in a cluster from that. I agree with @Anony-Mousse in that if you have a distance / similarity matrix (as opposed to a feature matrix), you will want to use different techniques, such as spectral clustering.

Sorry for being blunt, I also hate the "RTFM"-type answers, but the functions you used are well documented at:

In short,

  • the model sklearn.cluster.k_means() returns a tuple with three fields:
    • an array with the centroids (that should be 3x232 for you)
    • the label assignment for each point (i.e. a 232-long array with values 0-2)
    • and "intertia", a measure of how good the clustering is; there are several measures for that, so you might be better off not paying too much attention to this;
  • scipy.cluster.vq.kmeans2() returns a tuple with two fields:
    • the cluster centroids (as above)
    • the label assignment (as above)
    • kmeans() returns a "distortion" value instead of the label assignment, so I would definitely use kmeans2().

As for how to get to the coordinates of the points in each cluster, you could:

for cc inrange(clust_centers):
    print('Points for cluster {}:\n{}'.format(cc, data[model[1] == cc]))

where model is the tuple returned by either sklearn.cluster.k_means or scipy.cluster.vq.kmeans2, and data is a points x coordinates array, difference_matrix in your case.

Post a Comment for "Understanding Output From Kmeans Clustering In Python"