I'm looking for an algorithm to solve the following in a reasonable amount of time.
Given a set of sets, find all such sets that are subsets of a given set.
For example, if you have a set of search terms like ["stack overflow", "foo bar", ...], then given a document D, find all search terms whose words all appear in D.
I have found two solutions that are adequate:
Use a list of bit vectors as an index. To query for a given superset, create a bit vector for it, and then iterate over the list performing a bitwise OR for each vector in the list. If the result is equal to the search vector, the search set is a superset of the set represented by the current vector. This algorithm is
O(n)
where n is the number of sets in the index, and bitwise OR is very fast. Insertion isO(1)
. Caveat: to support all words in the English language, the bit vectors will need to be several million bits long, and there will need to exist a total order for the words, with no gaps.Use a prefix tree (trie). Sort the sets before inserting them into the trie. When searching for a given set, sort it first. Iterate over the elements of the search set, activating nodes that match if they are either children of the root node or of a previously activated node. All paths, through activated nodes to a leaf, represent subsets of the search set. The complexity of this algorithm is
O(a log a + ab)
wherea
is the size of the search set andb
is the number of indexed sets.
What's your solution?