tags:

views:

160

answers:

5

Hi all,

Can you suggest me any CPAN modules to search on a large sorted file?

The file is a structured data about 15 million to 20 million lines, but I just need to find about 25,000 matching entries so I don't want to load the whole file into a hash.

Thanks.

+5  A: 

Perl is well-suited to doing this, without the need for an external module (from CPAN or elsewhere).

Some code:

while (<STDIN>) {
    if (/regular expression/) {
         process each matched line
    }
}

You'll need to come up with your own regular expression to specify which lines you want to match in your file. Once you match, you need your own code to process each matched line.

Put the above code in a script file and run it with your file redirected to stdin.

dave
I want to avoid using nested while loop on the 25 million lines and 25,000 lines file as it's linear and takes so long.
est
+2  A: 

When you process an input file with while ( <$filehandle> ), it only takes the file one line at a time (for each iteration of the loop), so you don't need to worry about it clogging up your memory. Not so with a for loop, which slurps the whole file into memory. Use a regex or whatever else to find what you're looking for and put that in a variable/array/hash or write it out to a new file.

carillonator
+5  A: 

A scan over the whole file may be the fastest way. You can also try File::Sorted, which will do a binary search for a given record. Locating one record in a 25 million line file should require about 15-20 seeks for each record. This means that to search for 25,000 records, you would only need around .5 million seeks/comparison, compared to 25,000,000 to naively examine each row.

Disk IO being what it is, you may want to try the easy way first, but File::Sorted is a theoretical win.

jrockway
Your File::Sorted is using Moose which is not currently installed on our production environment. I found File::SortedSeek on CPAN which seems to do a binary search - does it serve the same purpose?And my testing on the speed performance and memory footprint is looking good.
est
+2  A: 

You don't want to search the file, so do what you can to avoid it. We don't know much about your problem, but here are some tricks I've used in previous problems, all of which try to do work ahead of time:

  • Break up the file into a database. That could be SQLite, even.
  • Pre-index the file based on the data that you want to search.
  • Cache the results from previous searches.
  • Run common searches ahead of time, automatically.

All of these trade storage space to for speed. Some some these I would set up as overnight jobs so they were ready for people when they came into work.

You mention that you have structured data, but don't say any more. Is each line a complete record? How often does this file change?

brian d foy
currently I'm quite happy with the performance of binary search (as suggested by jrockway) rather than slurping the whole 20 million lines into hash.the large file changes quarterly, so I will move this to use SQLite or DBM.
est
+2  A: 

Sounds like you really want a database. Consider SQLite, using Perl's DBI and DBD::SQLite modules.

Randal Schwartz