Understanding Output From Kmeans Clustering In Python
Solution 1:
You have two issues where, and the recommendation of k-means probably was not very good...
K-means expects a coordinate data matrix, not a distance matrix.
In order to compute a centroid, it needs the original coordinates. If you don't have coordinates like this, you probably should not be using k-means.
If you compute the difference of two distance matrixes, small values correspond to points that have a similar distance in both. These could still be very far away from each other! So if you use this matrix as a new "distance" matrix, you will get meaningless results. Consider points A and B, which have the maximum distance in both original graphs. After your procedure, they will have a difference of 0, and will thus be considered identical now.
So you haven't understood the input of k-means, no wonder you do not understand the output.
I'd rather treat the difference matrix as a similarity matrix (try absolute values, positives only, negatives only). Then use hierarchical clustering. But you will need an implementation for a similarity, the usual implementations for a distance matrix will not work.
Solution 2:
Disclaimer: below, I tried to answer your question about how to interpret what the functions return and how to get the points in a cluster from that. I agree with @Anony-Mousse in that if you have a distance / similarity matrix (as opposed to a feature matrix), you will want to use different techniques, such as spectral clustering.
Sorry for being blunt, I also hate the "RTFM"-type answers, but the functions you used are well documented at:
In short,
- the model
sklearn.cluster.k_means()
returns a tuple with three fields:- an array with the centroids (that should be
3x232
for you) - the label assignment for each point (i.e. a 232-long array with values 0-2)
- and "intertia", a measure of how good the clustering is; there are several measures for that, so you might be better off not paying too much attention to this;
- an array with the centroids (that should be
scipy.cluster.vq.kmeans2()
returns a tuple with two fields:- the cluster centroids (as above)
- the label assignment (as above)
kmeans()
returns a "distortion" value instead of the label assignment, so I would definitely usekmeans2()
.
As for how to get to the coordinates of the points in each cluster, you could:
for cc inrange(clust_centers):
print('Points for cluster {}:\n{}'.format(cc, data[model[1] == cc]))
where model
is the tuple returned by either sklearn.cluster.k_means
or scipy.cluster.vq.kmeans2
, and data
is a points x coordinates
array, difference_matrix
in your case.
Post a Comment for "Understanding Output From Kmeans Clustering In Python"