ansaurus

Question

index many documents to enable queries that supports AND/OR operations

Answer 1

A:

Google Desktop springs to mind, but all the major O/S have similar features built in or easily added. I'd describe the space used by Spotlight on my Mac as 'reasonable'.

High Performance Mark 2010-07-22 20:20:29

What I am looking for is how Google Desktop works behind the sense? :-)

MrOhad 2010-07-22 20:36:24

Answer 2

A:

With many documents, a dedicated package like Lucene becomes very attractive.

Wrikken 2010-07-22 20:23:43

I had like to see a description of an algorithm..Thanks!

MrOhad 2010-07-22 20:31:51

Answer 3

A:

As @Wrikken said, use lucene.

As you are interested in the algorithms used there are many, you can find a starting point here and more information here.

Recurse 2010-07-23 01:56:42

Answer 4

+3 A:

In case only 2 keywords are allowed (e.g. "key1 AND key2")

This is the solution I found so far: 1) keyMap, HashMap, where the key is the keywords and the value is the LinkedList of documents. 2) docMap, HashMap, where the key is the document id and the value is an HashSet of keywords

Now on such a query ("key1 AND key2") I would:

LinkedList docs = keyMap.get(key1);
for each (HashSet doc:docs)
     if(doc.contains(keys))
            result.add(doc);
return result

What do you think? Is there any better way? How about 3 keywords?

MrOhad 2010-07-24 19:48:02

I think you'll find my answer informative.

Dimitris Andreou 2010-07-24 20:03:39

Answer 5

+5 A:

The basic data structured needed is an inverted index. This maps each word to the set of documents that contain it. Lets say lookup is a function from words to document sets: lookup: W -> Pos(D) (where W is the set of words, D the set of documents, Pos(D) the power set of D).

Lets say you have an query expression of the form:

Expr ::= Word | AndExpression | OrExpression
AndExpression ::= Expr 'AND' Expr
OrExpression ::= Expr 'OR' Expr

So you get an abstract syntax tree representing your query. That's a tree with the following kind of nodes:

abstract class Expression { }

class Word extends Expression {
  String word
}

class AndExpression extends Expression {
  Expression left
  Expression right
}

class OrExpression extends Expression {
  Expression left
  Expression right
}

For example, foo AND (bar OR baz) would be translated to this tree:

       AndExpression
      /             \
     /               \
  Word('foo')         OrExpression
                    /             \
                   /               \
             Word('bar')       Word('baz')

To evaluate this tree, follow these simple rules, expressed in pseudocode:

Set<Document> evaluate(Expr e) {
   if (e is Word) 
      return lookup(e.word)
   else if (e is AndExpression) 
      return intersection(evaluate(e.left), evaluate(e.right))
   else if (e is OrExpression) 
      return union(evaluate(e.left), evaluate(e.right))

   //otherwise, throw assertion error: no other case remaining
}

//implemented by the inverted index, not shown
Set<Document> lookup(String word)

Thus, AND expressions are basically translated to set intersections, while OR expressions are translated to union expressions, all evaluated recursively. I'm sure if you stare long enough on the above, you'll see its beauty :)

You could represent each set (that lookup returns) as a HashSet. If you use Java, You could also use guava's lazy union and intersection implementations, that should be fun (especially if you study the code or use your imagination to see what 'lazy' really means in this context).

To the best of my knowledge, though, intersections are rarely computed by intersecting hashtables - instead, what usually happens is the following: assume are 3 sets to be intersected, we pick one (the smallest preferably) and assign a counter (equal to 1) to each of the documents. We then iterate the other sets, incrementing the counter of each found document. Finally, we report each document that its counter becomes 3 (that means that the document appeared in all sets, thus exists in their intersection).

Dimitris Andreou 2010-07-24 20:02:10

ansaurus

tags:

views:

answers:

index many documents to enable queries that supports AND/OR operations

related questions