Need a tool to search large structure text documents for words, phrases and related phrases | ansaurus

tags:

views:

35

answers:

2

+2 Q:

Need a tool to search large structure text documents for words, phrases and related phrases

I have to keep up with structured documents containing things such as requests for proposals, government program reports, threat models and all kinds of things like that. They are in techno-legalese as I would call them: highly structured, with section numbering and 3, 4 and 5 levels of nesting. All in English

I need a more efficient way to locate those paragraphs of nuggets that matter to me. So what I’d like is kind of a local document index/repository, that would allow me to have some standing queries and easily locate sections in documents that talk about my queries. Here’s an example:

I’d like to load in 10 large PDF files, each of say 100 pages. Each PDF contains English text, formatted very nicely into paragraphs and sections.
I’d like to specify that I am interested in “blogging platforms”, “weaknesses in Ruby”, “localization and internationalization”
Ideally then look at a list that showed the section of text, the name of the document, and other information that seemed to be related to and/or include the words and phrases I specified.

I am sure something like this exists. I would call it something like document indexing, document comprehension or structured searching.

A:

Take a look at Lucene: http://lucene.apache.org/ and Solr http://lucene.apache.org/solr/ , which can do most of what you ask. They are not exaclty featherweight though!

There is also this excellent book: http://www.amazon.com/Building-Search-Applications-Lucene-Lingpipe/dp/0615204252/

johanbev 2010-06-05 14:10:32

A:

Opengrok is another lightweight solution on top of Lucene: http://hub.opensolaris.org/bin/view/Project+opengrok/

Alternatively, you could have a look at http://www.alfresco.com, which is not lightweight solution but it is designed exactly for your purposes.

Moisei 2010-06-05 16:27:55

related questions

Which search technology to use with ASP.NET?

Best text search engine for integrating with custom web app?

In-house full-text search engine for source code and SQL scripts

How to implement in-process full text search engine

What are the full-text search tools you can use in SQL Server?

Make SQL Server index small numbers

How do I compare phrases for similarity?

How to do hit-highlighting of results from a SQL Server full-text query

LinqToSql and full text search - can it be done?

What are some Search Servers out there?

When should you use full-text indexing?

SQL Server Freetext match - how do I sort by relevance

Upgrade database from SQL Server 2000 to 2005 -- and rebuild full-text indexes?

How do I do full-text searching in Ruby on Rails?

SQL Server Full-Text Search: Hung processes with MSSEARCH wait type

MS SQL FTI - searching on "n*" returns numbers

Searching subversion history (full text)

How to implement a "related" degree measure algorithm?

Best full text search alternative to ms sql, c++ solution

How do you full-text search multiple criteria on left-joined tables in SQL Server?

Can you perform an AND search of keywords using FREETEXT() on SQL Server 2005?

SQL Server Full Text Searching

How to enable Full-text Indexing in SQL Server 2005 Express?

How do you get leading wildcard full-text searches to work in SQL Server?

Why doesn't SQL Full Text Indexing return results for words containing #?