Constrained K-Means implementation in Python

The following is just a Python implementation of the algorithm described in the following paper:

Bradley PS, Bennett KP, Demiriz A (2000) Constrained K-Means Clustering. Microsoft Research. Available: http://research.microsoft.com/pubs/69796/tr-2000-65.pdf

To cut the long story short, this algorithm allows to execute K-Means with a user-defined minimum number of points belonging to given clusters. The algorithm consists of regular K-Means implementation where cluster assignment is done by solving Minimum Cost Flow (MCF) problem, as opposed to just using clusters with smallest distance measure to the center. The MCF is formulated in such a way that data points are nodes with unitary supplies of flow, the cluster centers are nodes with user-defined flow demands, finally one additional node contains demand balancing the global sum of supply and demands to zero. The latter condition is necessary for feasibility of MCF solution. The data nodes are fully connected to center nodes with edge costs corresponding to Euclidean distances from data points to respective cluster centers. The center nodes are fully connected to the balance node with zero cost. The proof that the obtained solution is integer is included in the paper.

In my implementation I used the min_cost_flow() function from the NetworkX Python module to solve the MCF. Euclidean distances are multiplied by 1e9 and rounded down to nearest integer in order for min_cost_flow() to converge. Other than that it’s simply a K-Means implementation. The general syntax is the following:

(C, M, f) = constrained_kmeans(data, demand, maxiter=None, fixedprec=1e9)

where data is a vector of N-dimensional points (e.g. [[0, 0, 0], [1, 1, 1], [2,2,2]]), demand is a vector of minimum number of points for each cluster (implicitly it defines also the number of clusters), maxiter specifies the number of iterations after which to stop even if the algorithm didn’t converge, fixedprec allows to modify the above-mentioned distance multiplier. The return values are: C – cluster centers, M – assignments of point to clusters, f – last minimum cost flow solution. If demands specified are all equal to zero the algorithm will return a regular K-Means solution although it will be much slower because of the added calls to min_cost_flow().

In the file attached you will find also a usage example (100 3D points, 3 clusters of minimum 25 size each). The performance is far from optimal because of extremely poor interface between K-means and min_cost_flow() but it can be used for relatively small problems and was useful in my case. Hope it help you directly or get you started towards an implementation that works for you.

Download: constrained_kmeans.zip

10 Comments


Leave a Reply

Your email address will not be published. Required fields are marked *