views:

43

answers:

5

We have many documents that consists of words. What is most appropriate way to index the documents. A search query should support the AND/OR operations.

The query runtime should be efficient as possible. Please describe the space required for the index.

The documents contain words only(excluding AND/OR) and the query contains words and the keywords AND/OR.

EDIT: what would be the algorithm if I would allow only 2 keywords and an operation (e.g. w1 AND w2)

A: 

Google Desktop springs to mind, but all the major O/S have similar features built in or easily added. I'd describe the space used by Spotlight on my Mac as 'reasonable'.

High Performance Mark
What I am looking for is how Google Desktop works behind the sense? :-)
MrOhad
A: 

With many documents, a dedicated package like Lucene becomes very attractive.

Wrikken
I had like to see a description of an algorithm..Thanks!
MrOhad
A: 

As @Wrikken said, use lucene.

As you are interested in the algorithms used there are many, you can find a starting point here and more information here.

Recurse
+3  A: 

In case only 2 keywords are allowed (e.g. "key1 AND key2")

This is the solution I found so far: 1) keyMap, HashMap, where the key is the keywords and the value is the LinkedList of documents. 2) docMap, HashMap, where the key is the document id and the value is an HashSet of keywords

Now on such a query ("key1 AND key2") I would:

LinkedList docs = keyMap.get(key1);
for each (HashSet doc:docs)
     if(doc.contains(keys))
            result.add(doc);
return result

What do you think? Is there any better way? How about 3 keywords?

MrOhad
I think you'll find my answer informative.
Dimitris Andreou
+5  A: 

The basic data structured needed is an inverted index. This maps each word to the set of documents that contain it. Lets say lookup is a function from words to document sets: lookup: W -> Pos(D) (where W is the set of words, D the set of documents, Pos(D) the power set of D).

Lets say you have an query expression of the form:

Expr ::= Word | AndExpression | OrExpression
AndExpression ::= Expr 'AND' Expr
OrExpression ::= Expr 'OR' Expr

So you get an abstract syntax tree representing your query. That's a tree with the following kind of nodes:

abstract class Expression { }

class Word extends Expression {
  String word
}

class AndExpression extends Expression {
  Expression left
  Expression right
}

class OrExpression extends Expression {
  Expression left
  Expression right
}

For example, foo AND (bar OR baz) would be translated to this tree:

       AndExpression
      /             \
     /               \
  Word('foo')         OrExpression
                    /             \
                   /               \
             Word('bar')       Word('baz')

To evaluate this tree, follow these simple rules, expressed in pseudocode:

Set<Document> evaluate(Expr e) {
   if (e is Word) 
      return lookup(e.word)
   else if (e is AndExpression) 
      return intersection(evaluate(e.left), evaluate(e.right))
   else if (e is OrExpression) 
      return union(evaluate(e.left), evaluate(e.right))

   //otherwise, throw assertion error: no other case remaining
}

//implemented by the inverted index, not shown
Set<Document> lookup(String word)

Thus, AND expressions are basically translated to set intersections, while OR expressions are translated to union expressions, all evaluated recursively. I'm sure if you stare long enough on the above, you'll see its beauty :)

You could represent each set (that lookup returns) as a HashSet. If you use Java, You could also use guava's lazy union and intersection implementations, that should be fun (especially if you study the code or use your imagination to see what 'lazy' really means in this context).

To the best of my knowledge, though, intersections are rarely computed by intersecting hashtables - instead, what usually happens is the following: assume are 3 sets to be intersected, we pick one (the smallest preferably) and assign a counter (equal to 1) to each of the documents. We then iterate the other sets, incrementing the counter of each found document. Finally, we report each document that its counter becomes 3 (that means that the document appeared in all sets, thus exists in their intersection).

Dimitris Andreou