views:

35

answers:

2

I have to keep up with structured documents containing things such as requests for proposals, government program reports, threat models and all kinds of things like that. They are in techno-legalese as I would call them: highly structured, with section numbering and 3, 4 and 5 levels of nesting. All in English

I need a more efficient way to locate those paragraphs of nuggets that matter to me. So what I’d like is kind of a local document index/repository, that would allow me to have some standing queries and easily locate sections in documents that talk about my queries. Here’s an example:

  • I’d like to load in 10 large PDF files, each of say 100 pages. Each PDF contains English text, formatted very nicely into paragraphs and sections.

  • I’d like to specify that I am interested in “blogging platforms”, “weaknesses in Ruby”, “localization and internationalization”

  • Ideally then look at a list that showed the section of text, the name of the document, and other information that seemed to be related to and/or include the words and phrases I specified.

I am sure something like this exists. I would call it something like document indexing, document comprehension or structured searching.

A: 

Take a look at Lucene: http://lucene.apache.org/ and Solr http://lucene.apache.org/solr/ , which can do most of what you ask. They are not exaclty featherweight though!

There is also this excellent book: http://www.amazon.com/Building-Search-Applications-Lucene-Lingpipe/dp/0615204252/

johanbev
A: 

Opengrok is another lightweight solution on top of Lucene: http://hub.opensolaris.org/bin/view/Project+opengrok/

Alternatively, you could have a look at http://www.alfresco.com, which is not lightweight solution but it is designed exactly for your purposes.

Moisei