views:

429

answers:

4

We have a Perl-based web application whose data originates from a vast repository of flat text files. Those flat files are placed into a directory on our system, we extensively parse them inserting bits of information into a MySQL database, and subsequently move those files to their archived repository and permanent home (/www/website/archive/*.txt). Now, we don't parse every single bit of data from these flat files and some of the more obscure data items don't get databased.

The requirement currently out there is for users to be able to perform a full-text search of the entire flat-file repository from a Perl-generated webpage and bring back a list of hits that they could then click on and open the text files for review.

What is the most elegant, efficient and non CPU intensive method to enable this searching capability?

+3  A: 

I recommend using a dedicated search engine to do your indexing and searches.

I haven't looked at search engines recently, but I used ht://dig a few years ago, and was happy with the results.

Update: It looks like ht://dig is a zombie project at this point. You may want to use another engine. Hyper Estraier, besides being unpronounceable looks promising.

daotoad
+9  A: 

I'd recommend, in this order:

  1. Suck the whole of every document into a MySQL table and use MySQL's full-text search and indexing features. I've never done it but MySQL has always been able to handle more than I can throw at it.

  2. Swish-E (http://swish-e.org/) still exists and is designed for building full-text indexes and allowing ranked results. I've been running it for a few years and it works pretty well.

  3. You can use File::Find in your Perl code to chew through the repository like grep -r, but it will suck compared to one of the indexed options above. However, it will work, and might even surprise you :)

Nathan
Now that you mention it, I've heard good things about Swish-E. Great recommendation.
daotoad
I'll second the swish-e recommendation. It's a little bizarre at first (I found the terminology confusing) but once you get past that it works really, really well and really fast!
Joe Casadonte
Has anybody tried the MySQL option? I have wanted to mess with it since I noticed the section in the manual a version or two ago.
Nathan
I've found that Lucene (or even better SOLR) is the best way ... just have a tomcat server running and make request to it from your server side perl --- piece of cake to set up and run.
derby
+2  A: 

I second the recommendation to add an indexing machine. Consider Namazu from http://namazu.org. When I needed it, it looked easier to get started than Swish-e, ht://dig and I'm quite content with it.

If you don't want the overhead of an indexer, look at forking a grep/egrep. Once the text volume goes to multi-megabytes, this will be significantly faster than scanning solely in Perl, e.g:

open GREP, "find $dirlist -name '$filepattern' | xargs egrep '$textpattern' |"
                                         or die    "grep: $!";
while (<GREP>)  {
       ...
}

Bonus: use file name conventions like dates/tags/etc to reduce the set of files to grep. The clunky find ... | xargs ... is meant to work around the shell size limits on wildcard expansion which you might hit with big archives.

pklausner
A: 

I see someone recommended Lucene/Plucene. Check out KinoSearch, I have been using this for a year or more on a Catalyst-based project, very happy with the performance and ease of programming/maintenance.

The caveat on that page should be considered for your circumstance, but I can attest to the module's stability.

RET