views:

177

answers:

3
+2  Q: 

Superset Search

I'm looking for an algorithm to solve the following in a reasonable amount of time.

Given a set of sets, find all such sets that are subsets of a given set.

For example, if you have a set of search terms like ["stack overflow", "foo bar", ...], then given a document D, find all search terms whose words all appear in D.

I have found two solutions that are adequate:

  1. Use a list of bit vectors as an index. To query for a given superset, create a bit vector for it, and then iterate over the list performing a bitwise OR for each vector in the list. If the result is equal to the search vector, the search set is a superset of the set represented by the current vector. This algorithm is O(n) where n is the number of sets in the index, and bitwise OR is very fast. Insertion is O(1). Caveat: to support all words in the English language, the bit vectors will need to be several million bits long, and there will need to exist a total order for the words, with no gaps.

  2. Use a prefix tree (trie). Sort the sets before inserting them into the trie. When searching for a given set, sort it first. Iterate over the elements of the search set, activating nodes that match if they are either children of the root node or of a previously activated node. All paths, through activated nodes to a leaf, represent subsets of the search set. The complexity of this algorithm is O(a log a + ab) where a is the size of the search set and b is the number of indexed sets.

What's your solution?

+1  A: 

The prefix trie sounds like something I'd try if the sets were sparse compared to the total vocabulary. Don't forget that if the suffix set of two different prefixes is the same, you can share the subgraph representing the suffix set (this can be achieved by hash-consing rather than arbitrary DFA minimization), giving a DAG rather than a tree. Try ordering your words least or most frequent first (I'll bet one or the other is better than some random or alphabetic order).

For a variation on your first strategy, where you represent each set by a very large integer (bit vector), use a sparse ordered set/map of integers (a trie on the sequence of bits which skips runs of consecutive 0s) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452 (implemented in http://www.scala-lang.org/docu/files/api/scala/collection/immutable/IntMap.html).

If your reference set (of sets) is fixed, and you want to find for many of those sets which ones contain others, I'd compute the immediate containment relation (a directed acyclic graph with a path from a->b iff b is contained in a, and without the redundant arcs a->c where a->b and b->c). The branching factor is no more than the number of elements in a set. The vertices reachable from the given set are exactly those that are subsets of it.

wrang-wrang
A: 

First I would construct 2 data structures, S and E.

S is an array of sets (set S has the N subsets).

S[0] = set(element1, element2, ...)
S[1] = set(element1, element2, ...)
...
S[N] = set(element1, element2, ...)


E is a map (element hash for index) of lists. Each list contains S-indices, where the element appears.

// O( S_total_elements ) = O(n) operation
E[element1] = list(S1, S6, ...)
E[element2] = list(S3, S4, S8, ...)
...


Now, 2 new structures, set L and array C.

I store all the elements of D, that exist in E, in the L. (O(n) operation)
C is an array (S-indices) of counters.

// count subset's elements that are in E
foreach e in L:
  foreach idx in E[e]:
      C[idx] = C[idx] + 1

Finally,

for i in C:
    if C[i] == S[i].Count()
       // S[i] subset exists in D
Nick D
A: 

Can you build an index for your documents? i.e. a mapping from each word to those documents containing that word. Once you've built that, lookup should be pretty quick and you can just do set intersection to find the documents matching all words.

Here's Wiki on full text search.

EDIT: Ok, I got that backwards.

You could convert your document to a set (if your language has a set datatype), do the same with your searches. Then it becomes a simple matter of testing whether one is a subset of the other.

Behind the scenes, this is effectively the same idea: it would probably involve building a hash table for the document, hashing the queries, and checking each word in the query in turn. This would be O(nm) where n is the number of searches and m the average number of words in a search.

John Fouhy
I'm doing the converse. Given a million searches and one document, find all the searches that would match the document given your method.
Apocalisp