views:

342

answers:

2

I am using Lucene to index and search a small number of large documents. Using the demo from the Lucene site I have indexed the documents and am able to search them. However, the search result is not particularly useful as it points to the file of the document. With very large documents this isn't particularly useful.

I am wondering if Lucene can index these very large documents and create an abstraction over them which provides much more fine-grained results.

An example might better explain what I mean. Consider a very large book, such as the Bible. One file contains the entire text of the Bible, so with the demo, the result of searching for say, 'Damascus' would point to the file. What I would like to do is to retain the large document, but searches would return results pointing to a Book, Chapter or even as precise as a Verse. So a search for 'Damascus' could return (among others) Book 23, Chapter 7, Verse 8.

Is this possible (and best-practice in Lucene usage), or should I instead attempt to section the large document into many small files to index?

If it makes any difference, I am using Java Lucene 2.9.0 and am indexing HTML files approximately 1MB - 4MB in size. Which in terms of file size is not large, but it is large, relative to a person reading it.


EDIT

I don't think I've explained this as well as I could. Here goes for another example.

Say I take my large HTML file, and (for arguments sake) the search term 'Damascus' appears 3 times. Once on line 100 within a <div> tag, on line 2000 within a <p> tag, and on line 5000 within a <h1> tag. Is it possible to index with Lucene, such that there will be 3 results, and they can point to the specific element the term was within?

I don't think I want to provide a different document result for the term. So if the term 'Damascus' appeared twice within a specific <div>, there would only be one match.

It appears from a comment from Kragen that what I would want to do is parse the HTML when Lucene is going through the indexing phase. Then I can decide the chunk I want to consider as one document from what is read in by the parser. So if I see a div with a certain class I can begin a new Lucene document and it will be returned as a separate hit when a word within the div content is searched on.

Does this sound like what I want to do, and is it possible?

A: 

One way to do this is to create several documents out of a single book. The documents could represent books, chapters or verses. As the text need not be unique, this is what I would do. This way, the first verse in the first chapter in the book of Genesis will be indexed four times: in the whole bible, in the book of Genesis, in the first chapter and as the verse.

A subtlety here is the exact goal of retrieval: Do you want just to display the search keywords in context to a user? In this case consider using a Lucene highlighter. If you need the retrieval to be further used (i.e. take the retrieved pointer to a chapter or verse and do some processing on this place in the text) I would go with the finer-grained documents as I described before.

Yuval F
Goal is to provide display the HTML in a Swing application, search results will allow the user to navigate to that part of the HTML. Search may also provide a preview. Just to be clear, when you say 'create several documents out of a single book', do you mean Lucene documents, or new files?
Grundlefleck
I mean Lucene documents.
Yuval F
+1  A: 

Yes - Lucene records the offset of matching terms in a file, so that can be used to figure out where in the indexed content you need to look for matches.

There is a Lucene.Highlight add-on that does this exact task for you - try this article, there are also a couple of questions on StackOverflow concerning hit highlighting (many of these are tailored to use with web apps and so also do things like surrounding matching words with <b> tags)

UPDATE: Depending on how you search your index you might also find that its a good idea to split your large documents into smaller sections (for example chapters) as well - however this is more a question on how you want to organise, prioritise and present your results to the end user.

For example, supposing a user does a search for "foo" and there are 2 books containing that term. The first book (book A) might contain 2 chapters each of which have many references to "foo", however the term is barely mentioned in the rest of the book, however the second book (book B) contains many references to "foo", however they are scattered around the whole book. If you index by book, then you will probably find that book B is the first hit, however indexing by chapter you are likely to find that the 2 chapters from book A are the first 2 hits, followed by the chapters from book B.

Finally, obviously the user will be presented with 1 hit per matching document you have in your index - if you want to present your users with a list of matching books then obviously index by book, however you might find it more appropriate to present the user with a list of matching chapters in which case obviously index by chapter.

Kragen
Using your example, can I keep book B as a single file, in my case an HTML file, and create several Lucene Documents from within that one file, so that all the results from the single file can be reported to the user as discrete hits? Is it possible to index by chapter when the chapters are in the same *file*? Thanks for your answer :)
Grundlefleck
You can index by chapter by giving Lucene only a subset of that file when you index - this will give you 1 hit per matching chapter. If you want to present the user with a hit per discrete match then you will need to go through and find all the occurrences for each matching document - there is no way to split a book up into enough Lucene documents so that each hit is guaranteed to correspond to exactly 1 occurrence of that word / phrase.
Kragen