views:

101

answers:

3

i'm looking for an existing module(s) which enabled me to write basic boolean queries for matching and searching texts, WITHOUT writing my own parser etc.

for example,

president AND (ronald OR (george NOT bush))

would match TRUE against "the president ronald ragen" "the president ronald ragen and bush" "max bush was not a president"

but False on "george bush was a president" "i don't know how to spell ronald ragen"

(So far i found Booleano, which seems a bit overkill, but could do the task. However their group is inactive, and i couldn't figure out from the documentation what to do.)

thanks

Edit: the exact style or grammer is not critical. my aim is to provide non-tech users with the ability to search certain texts a bit beyond keyword search.

+1  A: 

It would be pretty lucky to find a pre-existing library that happens to be ready to parse the example expression that you provided. I recommend making your expression format a bit more machine readable, while retaining all of its clarity. A Lisp S-expression (which uses prefix notation) is compact and clear:

(and "president" (or "ronald" "george" "sally"))

Writing a parser for this format is easier than for your format. Or you could just switch to Lisp and it will parse it natively. :)

Side note: I assume you didn't mean to make your "NOT" operator binary, right?

seanmac7577
+1  A: 

You might want to take a look at the simpleBool.py code on this page that uses the pyparsing module. Otherwise, here's some simple code I wrote.

This isn't a module, but it might get you in the right direction.

def found(s,searchstr):
    return s.find(searchstr)>-1

def booltest1(s):
    tmp = found(s,'george') and not found(s,'bush')
    return found(s,'president') and (found(s,'ronald') or tmp)

print booltest1('the president ronald reagan')
print booltest1('george bush was a president')

and you can test other ones. I used tmp because the line was getting so long

Justin Peel
thanks, but your example is not a general purpose routine. simpleBool, seems interesting, but requires lots of work to adapt to the text-domain.
Berry Tsakala
+1  A: 

I use sphinx for full text search from python in my website. It has a simple syntax that supports boolean matchings, but with operators, not words. For example, your query would be president (regan|(bush -george)).

Lucene has the same feature.

THC4k