tags:

views:

60

answers:

3

My data is of people and tags, with a many-to-many relationship.

I need an algorithm that will find a minimal set of tags such that the union of the groups of people tagged with each tag in the solution will cover the entire population.

Any ideas?

+1  A: 

That is equivalent to NP-complete problem of vertex cover.

Simple sketch of a proof: If we can find a minimum vertex cover, we can find the smallest set of tags. Just create a graph with vertices = tags union persons and link them appropriately, plus connect all tags between each other. The smallest vertex cover corresponds to the minimal set of tags, which can be easily seen once we realize that for every vertex cover in such graph, there is a vertex cover of the same size consisting only of the "tag"-vertices.

The oposite (vertex cover can be reduced to minimal tags) can be shown by creating a tag and a person for each vertex and linking the tag for each vertex with person for the same vertex + its neighbors.

jpalecek
You are right but that is a terrible "proof"
SoapBox
+4  A: 

This is an NP-hard problem. It will take a LOT of processing to make sure that you do, in fact, have the absolute minimum number of tags required.

The pretty quick and easy solution that I would use is to

while there are users left in the pool:
    find the tag that represents the most users
    add that tag to the list
    remove all the users that that tag represents from the pool

If you want, you can then loop through and make sure there aren't any 
unnecessary tags //but that probably won't help much

There are, of course, a number of ways that tags could be laid out such that that's not the best way and won't find an optimal solution. However, I'm pretty sure it should get pretty close.

zebediah49
+1, this will select tags at most `ln(n)` more tags than optimal, while in practice it usually does a lot better (Looked this up in Skiena's Algorithm Design Manual, http://www.algorist.com/ )
Dimitris Andreou
+1  A: 

A many-to-many relationship between people and tags forms a bipartite graph, which is luckily quite different from the generic case. The problem, therefore, is not NP-complete, but can be solved in polynomial time. It seems to be equal to finding a maximum matching, for which Wikipedia provides several alternatives.

Ezku