My data is of people and tags, with a many-to-many relationship.
I need an algorithm that will find a minimal set of tags such that the union of the groups of people tagged with each tag in the solution will cover the entire population.
Any ideas?
My data is of people and tags, with a many-to-many relationship.
I need an algorithm that will find a minimal set of tags such that the union of the groups of people tagged with each tag in the solution will cover the entire population.
Any ideas?
That is equivalent to NP-complete problem of vertex cover.
Simple sketch of a proof: If we can find a minimum vertex cover, we can find the smallest set of tags. Just create a graph with vertices = tags union persons and link them appropriately, plus connect all tags between each other. The smallest vertex cover corresponds to the minimal set of tags, which can be easily seen once we realize that for every vertex cover in such graph, there is a vertex cover of the same size consisting only of the "tag"-vertices.
The oposite (vertex cover can be reduced to minimal tags) can be shown by creating a tag and a person for each vertex and linking the tag for each vertex with person for the same vertex + its neighbors.
This is an NP-hard problem. It will take a LOT of processing to make sure that you do, in fact, have the absolute minimum number of tags required.
The pretty quick and easy solution that I would use is to
while there are users left in the pool:
find the tag that represents the most users
add that tag to the list
remove all the users that that tag represents from the pool
If you want, you can then loop through and make sure there aren't any
unnecessary tags //but that probably won't help much
There are, of course, a number of ways that tags could be laid out such that that's not the best way and won't find an optimal solution. However, I'm pretty sure it should get pretty close.
A many-to-many relationship between people and tags forms a bipartite graph, which is luckily quite different from the generic case. The problem, therefore, is not NP-complete, but can be solved in polynomial time. It seems to be equal to finding a maximum matching, for which Wikipedia provides several alternatives.