ansaurus

Question

How do I do a full-text search search of flat files with Perl?

Answer 1

+3 A:

I recommend using a dedicated search engine to do your indexing and searches.

I haven't looked at search engines recently, but I used ht://dig a few years ago, and was happy with the results.

Update: It looks like ht://dig is a zombie project at this point. You may want to use another engine. Hyper Estraier, besides being unpronounceable looks promising.

daotoad 2009-02-26 19:03:33

Answer 2

+9 A:

I'd recommend, in this order:

Suck the whole of every document into a MySQL table and use MySQL's full-text search and indexing features. I've never done it but MySQL has always been able to handle more than I can throw at it.
Swish-E (http://swish-e.org/) still exists and is designed for building full-text indexes and allowing ranked results. I've been running it for a few years and it works pretty well.
You can use File::Find in your Perl code to chew through the repository like grep -r, but it will suck compared to one of the indexed options above. However, it will work, and might even surprise you :)

Nathan 2009-02-27 01:15:31

Now that you mention it, I've heard good things about Swish-E. Great recommendation.

daotoad 2009-02-27 02:22:06

I'll second the swish-e recommendation. It's a little bizarre at first (I found the terminology confusing) but once you get past that it works really, really well and really fast!

Joe Casadonte 2009-02-27 12:49:41

Has anybody tried the MySQL option? I have wanted to mess with it since I noticed the section in the manual a version or two ago.

Nathan 2009-02-27 17:58:17

I've found that Lucene (or even better SOLR) is the best way ... just have a tomcat server running and make request to it from your server side perl --- piece of cake to set up and run.

derby 2009-02-27 19:47:43

Answer 3

+2 A:

I second the recommendation to add an indexing machine. Consider Namazu from http://namazu.org. When I needed it, it looked easier to get started than Swish-e, ht://dig and I'm quite content with it.

If you don't want the overhead of an indexer, look at forking a grep/egrep. Once the text volume goes to multi-megabytes, this will be significantly faster than scanning solely in Perl, e.g:

open GREP, "find $dirlist -name '$filepattern' | xargs egrep '$textpattern' |"
                                         or die    "grep: $!";
while (<GREP>)  {
       ...
}

Bonus: use file name conventions like dates/tags/etc to reduce the set of files to grep. The clunky find ... | xargs ... is meant to work around the shell size limits on wildcard expansion which you might hit with big archives.

pklausner 2009-02-27 16:40:59

Answer 4

A:

I see someone recommended Lucene/Plucene. Check out KinoSearch, I have been using this for a year or more on a Catalyst-based project, very happy with the performance and ease of programming/maintenance.

The caveat on that page should be considered for your circumstance, but I can attest to the module's stability.

RET 2009-03-03 05:47:36

ansaurus

tags:

views:

answers:

How do I do a full-text search search of flat files with Perl?

related questions