ansaurus

Question

Superset Search

Answer 1

+1 A:

The prefix trie sounds like something I'd try if the sets were sparse compared to the total vocabulary. Don't forget that if the suffix set of two different prefixes is the same, you can share the subgraph representing the suffix set (this can be achieved by hash-consing rather than arbitrary DFA minimization), giving a DAG rather than a tree. Try ordering your words least or most frequent first (I'll bet one or the other is better than some random or alphabetic order).

For a variation on your first strategy, where you represent each set by a very large integer (bit vector), use a sparse ordered set/map of integers (a trie on the sequence of bits which skips runs of consecutive 0s) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452 (implemented in http://www.scala-lang.org/docu/files/api/scala/collection/immutable/IntMap.html).

If your reference set (of sets) is fixed, and you want to find for many of those sets which ones contain others, I'd compute the immediate containment relation (a directed acyclic graph with a path from a->b iff b is contained in a, and without the redundant arcs a->c where a->b and b->c). The branching factor is no more than the number of elements in a set. The vertices reachable from the given set are exactly those that are subsets of it.

wrang-wrang 2009-08-12 00:14:36

Answer 2

A:

First I would construct 2 data structures, S and E.

S is an array of sets (set S has the N subsets).

S[0] = set(element1, element2, ...)
S[1] = set(element1, element2, ...)
...
S[N] = set(element1, element2, ...)

E is a map (element hash for index) of lists. Each list contains S-indices, where the element appears.

// O( S_total_elements ) = O(n) operation
E[element1] = list(S1, S6, ...)
E[element2] = list(S3, S4, S8, ...)
...

Now, 2 new structures, set L and array C.

I store all the elements of D, that exist in E, in the L. (O(n) operation)
C is an array (S-indices) of counters.

// count subset's elements that are in E
foreach e in L:
  foreach idx in E[e]:
      C[idx] = C[idx] + 1

Finally,

for i in C:
    if C[i] == S[i].Count()
       // S[i] subset exists in D

Nick D 2009-08-12 00:53:33

Answer 3

A:

Can you build an index for your documents? i.e. a mapping from each word to those documents containing that word. Once you've built that, lookup should be pretty quick and you can just do set intersection to find the documents matching all words.

Here's Wiki on full text search.

EDIT: Ok, I got that backwards.

You could convert your document to a set (if your language has a set datatype), do the same with your searches. Then it becomes a simple matter of testing whether one is a subset of the other.

Behind the scenes, this is effectively the same idea: it would probably involve building a hash table for the document, hashing the queries, and checking each word in the query in turn. This would be O(nm) where n is the number of searches and m the average number of words in a search.

John Fouhy 2009-08-12 01:41:10

I'm doing the converse. Given a million searches and one document, find all the searches that would match the document given your method.

Apocalisp 2009-08-12 12:47:05

ansaurus

tags:

views:

answers:

Superset Search

related questions