views:

88

answers:

1

Hi Guys,

I am trying to use Lucene (actually PyLucene!) to find out how many documents contain my exact phrase. My code currently looks like this... but it runs rather slow. Does anyone know a faster way to return document counts?

phraseList = ["some phrase 1", "some phrase 2"] #etc, a list of phrases...

countsearcher = IndexSearcher(SimpleFSDirectory(File(STORE_DIR)), True)
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)

for phrase in phraseList:
     query = QueryParser(Version.LUCENE_CURRENT, "contents", analyzer).parse("\"" + phrase + "\"")
     scoreDocs = countsearcher.search(query, 200).scoreDocs
     print "count is: " + str(len(scoreDocs))
+2  A: 

Typically, writing custom hit collector is the fastest way to count the number of hits using a bitset as illustrated in javadoc of Collector.

Other method is to get TopDocs with number of results specified as one.

TopDocs topDocs = searcher.search(query, filter, 1);

topDocs.totalHits will give you the total number of results. I'm not sure if this is as fast as it involves calculating scores, which is skipped in aforementioned method.

These solutions are applicable for Java. You have to check equivalent technique in Python.

Shashikant Kore