Any good implementation of greedy set cover for large datasets?

Legend picture Legend · Oct 29, 2011 · Viewed 7.4k times · Source

This question follows from a related question of mine posted here. @mhum suggested that my problem falls into the covering problem domain. I tried encoding my question into a minimum set cover problem and currently I have a dataset in this form:

Set        Cost
(1,2)        1
(1)          1
(1,2,3)      2
(1)          2
(3,4)        2
(4)          3
(1,2)        3
(3,4)        4
(1,2,3,4)    4

The objective is to find a good set cover that covers all numbers and one that attempts to minimize the total cost. My dataset is big with at least 30000 sets (of size varying from 5-40 elements) like this. Are there any good greedy implementations to solve this or am I on my own to implement this? I am not an expert in LP but any LP-solvers (from numpy/scipy) that can solve this are also acceptable.

Answer

Antti Huima picture Antti Huima · Oct 29, 2011

There is a well-known greedy approximation algorithm for set cover that is also easy to implement in whatever language of your choice. The algorithm itself is described here:

http://en.wikipedia.org/wiki/Set_cover_problem#Greedy_algorithm

It is so simple that the easiest thing is just to write it from scratch.

Notably, it is also the best polynomial-time approximation algorithm known for set cover. That means that to get better worst-case performance (more compact result set) you would need to have non-polynomial running times (= slow algorithms for large sets).

Unfortunately the Wikipedia entry doesn't actually cover weighted set cover, which is the case here. The extension is simple, and is described e.g. here:

http://pages.cs.wisc.edu/~shuchi/courses/880-S07/scribe-notes/lecture03.pdf

Some more useful notes:

http://www.cs.ucr.edu/~neal/non_arxiv/Young08SetCover.pdf http://www.cs.uiuc.edu/class/sp08/cs473/Lectures/lec20.pdf