tags:

views:

31

answers:

3

Hi community,

I have to search a huge number of text files (all over a Unix server disks') for a given string (I have to). Given the time and resources this will take, I'm thinking coming out with the list of files that do contain the token in question is a meager result, compared to the investment.

This feels wrong.

Considering that I will have to parse all these files anyway, wouldn't it be more profitable to build an index of this content, at least for statistics?

How can I do that? What tool?

Any hints appreciated :)

A: 

If you have to do a "one time" search, setting up an indexer may be overkill, but if you plan to do more than one search, an interesting tool i heard about is strigi

It is already packaged at least for debian, ubuntu, gentoo, is OS and DE inpedendent, and has graphical and command line interfaces.

enzotib
strigi looks good, but no binaries for HP-UX. Compilation sounds a bit tricky (requires Cmake and stuff..). Thanks for the lead!
ExpertNoob1
A: 

Will the files change often enough that maintaining the index will be an issue? If so, then consider whether you will use it often enough to justify the time and effort in keeping it up to date.

Personally, I'd just use find / -name \*.txt -exec grep -n "my search string" {} \; 2>/dev/null >/tmp/grep.out (adjust arguments as appropriate) and then sit back and listen to the disk chatter...

TMN
A: 

I used to use

find . -type f -print0 | xargs -0 grep -Pl "string"

but then I started using ack which is way faster and if necessary skips backup files.

Making an index of everything is a huge task. I found out even Berkeley BDB starts to slow down after a few hundred million entries.

Lucene (http://lucene.apache.org/) is an indexing full-text search for websites. I assume that could be used for indexing the whole disk.

anttir