# Constrained K-Means implementation in Python

The following is just a Python implementation of the algorithm described in the following paper:

Bradley PS, Bennett KP, Demiriz A (2000) Constrained K-Means Clustering. Microsoft Research. Available: http://research.microsoft.com/pubs/69796/tr-2000-65.pdf To cut the long story short, this algorithm allows to execute K-Means with a user-defined minimum number of points belonging to given clusters. The algorithm consists of regular K-Means implementation where cluster assignment is done by solving Minimum Cost Flow (MCF) problem, as opposed to just using clusters with smallest distance measure to the center. The MCF is formulated in such a way that data points are nodes with unitary supplies of flow, the cluster centers are nodes with user-defined flow demands, finally one additional node contains demand balancing the global sum of supply and demands to zero. The latter condition is necessary for feasibility of MCF solution. The data nodes are fully connected to center nodes with edge costs corresponding to Euclidean distances from data points to respective cluster centers. The center nodes are fully connected to the balance node with zero cost. The proof that the obtained solution is integer is included in the paper.

In my implementation I used the min_cost_flow() function from the NetworkX Python module to solve the MCF. Euclidean distances are multiplied by 1e9 and rounded down to nearest integer in order for min_cost_flow() to converge. Other than that it’s simply a K-Means implementation. The general syntax is the following:

```(C, M, f) = constrained_kmeans(data, demand, maxiter=None, fixedprec=1e9)
```

where data is a vector of N-dimensional points (e.g. [[0, 0, 0], [1, 1, 1], [2,2,2]]), demand is a vector of minimum number of points for each cluster (implicitly it defines also the number of clusters), maxiter specifies the number of iterations after which to stop even if the algorithm didn’t converge, fixedprec allows to modify the above-mentioned distance multiplier. The return values are: C – cluster centers, M – assignments of point to clusters, f – last minimum cost flow solution. If demands specified are all equal to zero the algorithm will return a regular K-Means solution although it will be much slower because of the added calls to min_cost_flow().

In the file attached you will find also a usage example (100 3D points, 3 clusters of minimum 25 size each). The performance is far from optimal because of extremely poor interface between K-means and min_cost_flow() but it can be used for relatively small problems and was useful in my case. Hope it help you directly or get you started towards an implementation that works for you.

• nicolas says:

I was looking for this , thank you !

• Nestor Rodriguez says:

That’s what I was looking for, many thanks ad good job!

• Alex says:

The code generates an error. Apparently ‘xrange’ is not defined (??)

• Alex says:

Hey, great code. Is there any way to use a MAXIMUM size constraint, instead of a minimum size?

• Sepideh Nikookar says:

That would be very helpful if you can send me your code.

Best Regards,
Sepideh Nikookar
Ph.D. student at NJIT

• Sepideh says: