tags:

views:

487

answers:

3

I have two separate indexes holding different fields that together contain all the searchable fields for an index. For example the first index holds the indexed text for all documents, and the second holds tags for each and every document.

Note the example below is a bit wonky as I've changed the names of the entities. Index1: text document-id

Index2: tag-name: "very important" user: "Fred's id"

I would like to keep the indexes separate as it seems wasteful to continually update a single index whenever a user adds/removes a tag.

So far I think I might need to process the two search results and merge them manually (in code).Any other suggestions ?

I do not want to merge separate/sharded indexes.

+1  A: 
erickson
This just seems wrong having to hope that document ids match in both cases.I would like to manage this properly.
mP
It's more than just a hope that the document IDs will match as documents are added to the index; IDs are simply a sequence number. What's not clear to me is whether Lucene will reassign document IDs to "compact" an index that has a high proportion of deleted records (remembering that an "update" in Lucene is a delete of the original record followed by an add of the "updated" record).
erickson
"sequence number" is closer to the true definition than "document id", but they are truly just an "offset". As an index is optimized, and deleted documents are physically removed from the underlying index files (sort of like de-fragmenting the index), these offsets will change, and there's no (easy) way to detect it. The most common solution to this problem that I've come across is to store your own unique id in an "id" field in your Lucene document.
ph0enix
A: 

Seems like you need to merge the indexes in code. If I understand correctly, when searching for a term, there can be either matches to document text or to tags, and each tag is indexed with its relevant document ids. You will then have two hit lists to merge. As tags and full text are very different entities, you will need some weighting (maybe as field boosts during retrieval) to reach a good ranking. Thus, you can merge the tag hit and full text hit for document k using a formula like:

score(k) = a*tagscore(k)+b*fulltextscore(k)

Where a and b will be empirically determined coefficients.

For a much more detailed discussion, see Grant Ingersoll's findability and debugging relevance issues in search papers.

Yuval F
Scoring is not an issue as merges will be on boolean query boundaries. The real question remains in terms of how to do the search.
mP
@mP: Please clarify. If you store a unique id per document in both indexes, I see no problem in the search. I do see a ranking/scoring problem - if you get 1000 hits from document text and 2000 hits from tags, you would probably want to display the top 20 or so; This is where scoring matters.
Yuval F
A: 

The main problem with this approach has to do with ranking of documents because the default algorithm (and probably most custom algorithms, with a few exceptions) are based on term frequency and inverse document frequency.

In other words, the Scorer needs to know how many times a term appears within a document, as well as how many other documents contain the term. This information is stored for each term in the index, but not the aggregate across multiple indexes.

The common solution to this problem is a two-phased approach. First, the query is run against each index to determine how many documents contain each term. Next, the results are aggregated and the query is run again, but this time the inverse document frequency is sent along with it.

As you can imagine, this won't perform as well as running a query against a single index, but since nothing is free, I suppose that is the trade-off to storing documents across multiple indexes.

ph0enix