Skip to content Skip to sidebar Skip to footer

How To Cluster With K-means, When Number Of Clusters And Their Sizes Are Known

I'm clustering some data using scikit. I have the easiest possible task: I do know the number of clusters. And, I do know the size of each cluster. Is it possible to specify this i

Solution 1:

No. You need some type of constrained clustering algorithm to do this, and none are implemented in scikit-learn. (This is not "the easiest possible task", I wouldn't even know of a principled algorithm that does this, aside from some heuristic moving of samples from one cluster to another.)

Solution 2:

It won't be k-means anymore.

K-means is variance minimization, and it seems your objective is to produce paritions of a predefined size, not of minimum variance.

However, here is a tutorial that shows how to modify k-means to produce clusters of the same size. You can easily extend this to produce clusters of the desired sizes instead of the average size. It's fairly easy to modify k-means this way. But the results will be even more meaningless than k-means results on most data sets. K-means is often just as good as random convex partitions.

Solution 3:

I can think only of bruteforce algorithm. If clusters are well separated then you may try to run clustering several times with different random initializations providing just number of clusters as an input. After each iteration count size of each cluster, sort it and compare to sorted list of known cluster sizes. If they don't match rinse and repeat.

Post a Comment for "How To Cluster With K-means, When Number Of Clusters And Their Sizes Are Known"