tags:

views:

286

answers:

1

My app needs to keep an index of files in which the files are known by tags and attributes, suggesting a Lucene (Java) document schema like:

tags: i s (indexed, stored)
attributes: i s
content: i
fileId: i s

(The actual file is looked up by id in sqlite.) However, while a file has only one set of tags/attributes, it may have multiple versions of its content (each identified by a versionId).

The only real solution it seems is one document type, with one document for each version such that the tags and attributes are redundant across many documents:

tags: i s
attributes: i s
content: i
versionId: i s
fileId: i s

My concern about this schema is whether it will be performant enough and compact enough. So here are my questions:

  1. If I understand Lucene's indexing scheme correctly, when the same long string is indexed as a field in many documents, this doesn't really bulk out the index compared to if it were indexed just once. Correct?

  2. If I create a single Term object, make it stored, and then add it to many documents, does the full string data get duplicated for each document in the index? If this is the case, am I just best off putting the actual storage of the tags/attributes into sql?

  3. As far as I can tell, the only info that comes back in query results is the documents themselves ordered by score. To determine which fields satisfied the query for a matched document, must I do separate queries on the fields for each document, or what?

Understand that this is just a client-side app, so concurrent access is a non-issue, and index updates will be quite infrequent (every time the user retags or edits/creates a file). I'm mainly concerned about real-time response for a single user and to some extent about index size (though more for conserving memory rather than disk space).


MORE BACKGROUND

I considered some alternative document schema, but rejected them. My initial instinct was to avoid data duplication by splitting documents into two types, one type for representing a file:

tags: i s
attributes: i s
fileId: i s

...but then one document type for representing the versions of files:

content: i
fileId: i s
versionId: i s

There are a number of problems with this:

First it requires doing separate queries for content and tags/attributes and then matching content results to files: for each version document in my results, I must look at its fileId to then look up the corresponding file document in a separate query. While this is a standard relational technique, my understanding is that it's a rather awkward and slow thing to do in Lucene.

Second, for a query requiring both "pizza" and "hot dog", I want to get back the file versions that include both those terms in either the tags/attributes or content or "hot dog" in one and "pizza" in the other. By splitting the tags/attributes from their content, this becomes very tricky (and likely expensive).

So maybe I can just keep content and tags/attributes together by keeping multiple content fields:

tags: i s
attributes: i s
content: i  (multiple fields)
fileId: i s

The question is whether I can identify a content field so I can know which version content produced the hit. I could name each content field differently, corresponding to the version id:

tags: i s
attributes: i s
content {versionId}: i
content {versionId}: i
content {versionId}: i   # etc.
fileId: i s

Even if I could identify the content field(s) that caused the document to match the query, consolidating the versions messes up the scoring.

+2  A: 
  1. If I understand Lucene's indexing scheme correctly, when the same long string is indexed as a field in many documents, this doesn't really bulk out the index compared to if it were indexed just once. Correct?
  2. If I create a single Term object, make it stored, and then add it to many documents, does the full string data get duplicated for each document in the index? If this is the case, am I just best off putting the actual storage of the tags/attributes into sql?
  3. As far as I can tell, the only info that comes back in query results is the documents themselves ordered by score. To determine which fields satisfied the query for a matched document, must I do separate queries on the fields for each document, or what?
  1. Correct. Lucene stores a dictionary mapping strings to numerical identifiers, so the memory consumed is only to store the identifier several times.
  2. I think you are safe storing the tags and attributes in Lucene.
  3. You do not need separate queries - once you hold a Document object, you can use e.g. getField() to get the relevant field information. Since you are concerned about Lucene performance, I suggest you read Scaling Lucene and Solr, which covers lots of performance tips.
Yuval F
Thanks, yuval3) I understand I can inspect the Document fields, but I want to get the set of fields in the Document that matched the query. My impression is that you're supposed to use filters and scoring in the query rather than sorting that stuff out after the fact.
Jegschemesch
This is a subtle issue. Lucene does not give this information as a default. I suggest you read: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-Search and go on to explore Lucene explanations and highlighting. HTH
Yuval F