views:

281

answers:

5

Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.

What i am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.

Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.

Basically i want to get some pointers for the doesitmatch function in the sample code below.

def doesitmatch(contents, searchstring):
    """
    returns result of searching contents for searchstring (True or False)
    """
    ???????
    ???????


story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']

matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]

Edit: Additionally would also be interested to know if any module exists to convert lucene query like 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")' into regex

A: 

Probably slow, but easy solution:

Make a query on the story plus each string to the search engine. If it returns anything, then it matches.

Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.

Lennart Regebro
The problem on using my search engine(solr) for this is that the in code above the list searchstrings would have 10,000s of phrases. It would not be ideal to hit the search server 10,000s of times per story. would be very expensive.Im not using any complex stuff, only: AND, OR, Quotes and Brackets Im thinking of writing a function to convert it to regex, but given my limited regex skills i thought to investigate if such a function already exists in python...
sajal
A: 

Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.

You can also try pyLucene, but i did'nt investigate this one.

thomas
A: 

Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.

def search_strings_matching(story_id_to_match, search_strings):
    result = set()
    for s in search_strings:
        result_story_ids = query_index(s) # query_index returns an id iterable
        if story_id_to_match in result_story_ids:
            result.add(s)
    return result
Vinay Sajip
The problem is that my index is solr running on another server, and search_strings would have over 10,000+ terms in it. running so many queries would be expensive in terms of time and resources.
sajal
How often do the search strings change?
Vinay Sajip
several times a day (not yet fully decided its upcoming project) ... but > 1ce/hour
sajal
+1  A: 

After extensive googling, i realized what i am looking to do is a Boolean search.

Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/

Issue looks solved for now.

sajal
A: 

This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.

Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.

This has nothing to do with Python, though. You'd probably be better off writing something like this in java.

itsadok
looks interesting, i am already using solr(lucene based) for some stuff, ill see if it can be made into using this.The reason id prefer it to be in python is because im using it within a django project. moreover i cant even write hello world in java :)
sajal