views:

244

answers:

2

I have a large document with various sections. Each section has a list of keywords /phrases of interest. I have a master list of keywords/phrases stored as a String array. How can I use Solr or Lucene to search each section document for all keywords and basically give me which keywords were found ? I cant think of any straightforward way to implement this....

Thanks

+1  A: 

Start with the basics

Have the program running, you will learn how lucene indexes, this should help to index and search the documents containing fields

decide about your data, how the fields needs to be stored,. i.e.; DateFields shall be stored as Field.Index.NOT_ANALYZED instead of Field.Index.ANALYZED

now next step shall be

//indexmap ==>  HashMap  
//keywordfields ==> you master list of keywords/phrases
//selectfields ==> your document field (contained in lucene index)
String[] keywordfields = (String[]) indexmap.get("keywordfields").toString().split(",");
String[] selectFields = (String[]) indexmap.get("indexfields").toString().split(",");
//create a booleanquery
BooleanQuery bq = new BooleanQuery(); 
//iterate the keywordfields
for (int i = 0; i < keywordfields.length; i++) {
    bq.add(new BooleanClause(new TermQuery(new Term(keywordfields[i], (String)params.get(SEARCH_QUERYSTRING))),BooleanClause.Occur.SHOULD));
    }
//pass the boolean query object to the indexsearcher
 topDocs = indexSearcher.search(rq, 1000);
//get a reference to ScoreDoc
 ScoreDoc[] hits = topDocs.scoreDocs;
 //Iterate the hits

  Map <String, Object> resultMap = new HashMap<String, Object>();
  List<Map<String, String>> resultList = new ArrayList<Map<String, String>>();
                   for (ScoreDoc scoreDoc : hits) {
        int docid = scoreDoc.doc;
        FieldSelector fieldselector = new MapFieldSelector(selectFields);
        Document doc = indexSearcher.doc(docid, fieldselector);

        Map<String, String> searchMap = new HashMap<String, String>();
        // get all fields for documents we got
        List<Field> fields = doc.getFields();
        for (Field field : fields) {
         searchMap.put(field.name(), field.stringValue());
         System.out.println("Field Name:" + field.name());
         System.out.println("Field value:" + field.stringValue());
        }
        resultList.add(searchMap);
        resultMap.put(TOTAL_RESULTS, hits.length);
        resultMap.put(RS, resultList);
       }    
      } catch (Exception e) {
       e.printStackTrace();
      }

This shall be one of the implementation using Lucene =]

Narayan
Thanks. We have decided to center around Solr. Could someone kindly provide a Solr example of the same - using Solrj perhaps ?My keywords can be either 1 key (eg. Solr) or 2 keys (eg. Apache Lucene) upto 5 keys... eg(Apache Lucene Web Service Deploy)
**Can you start a new question, and mark the responses appropriately, on how it helped you =]**
Narayan
A: 

It sounds like all you know is the analysis functionality of Lucene. At the heart of this functionality is the Analyzer class. From the documentation:

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

There are many Analyzer classes to choose from, but StandardAnalyzer usually does a good job:

// For each chapter...

Reader reader = ...; // You are responsible for opening a reader for each chapter
Analyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("", reader);

Token token = new Token();
while ((token = tokenStream.next(token)) != null) ) {
    String keyword = token.term();
    // You can now do whatever you wish with this keyword
}

You may find that other analyzers will do a better job for your purposes.

Adam Paynter