Each document in my Lucene index is kind of similar to a post in stackoverflow and I am trying to search through the index (which contains millions of documents). Each user should only be able to search through the user's company posts only. I have no control over how the data is indexed and I only need to implement a simple search (that works) on top of it.
Here is my first draft:
String q = "mysql"
String companyId = "1001"
String[] fields = { "body", "subject", "number", "category", "tags"};
Float float10 = new Float(10);
Float float5 = new Float(5);
Map<String, Float> boost = new HashMap<String, Float>();
boost.put("body", float10);
boost.put("subject", float10);
boost.put("number", float5);
boost.put("category", float5);
boost.put("tags", float5);;
MultiFieldQueryParser mfqp = new MultiFieldQueryParser(fields, new StandardAnalyzer(), boost);
mfqp.setAllowLeadingWildcard(true);
Query userQuery = mfqp.parse(q);
TermQuery companyQuery = new TermQuery(new Term("company_id", companyId));
BooleanQuery booleanQuery = new BooleanQuery();
BooleanQuery.setMaxClauseCount(50000)
booleanQuery.add(userQuery, BooleanClause.Occur.MUST);
booleanQuery.add(companyQuery, BooleanClause.Occur.MUST);
FSDirectory directory = FSDirectory.getDirectory(new File("/tmp/index"));
IndexSearcher searcher = SearcherManager.getIndexSearcherInstance(directory);
Hits hits = searcher.search(booleanQuery);
Its mostly working functionally, but I am seeing some memory issues. I get an Out of Memory error every 4, 5 days and I took a heapdump and saw that Lucene Term and TermInfo objects top the list. I am using singleton instance of IndexSearcher and I can see only one instance of it in the heap.
Any reviews on the way I am doing? What I am doing wrong and what I can do better in general?