ansaurus

Question

How can I search a large sorted file in Perl?

Answer 1

+5 A:

Perl is well-suited to doing this, without the need for an external module (from CPAN or elsewhere).

Some code:

while (<STDIN>) {
    if (/regular expression/) {
         process each matched line
    }
}

You'll need to come up with your own regular expression to specify which lines you want to match in your file. Once you match, you need your own code to process each matched line.

Put the above code in a script file and run it with your file redirected to stdin.

dave 2009-12-22 02:25:52

I want to avoid using nested while loop on the 25 million lines and 25,000 lines file as it's linear and takes so long.

est 2009-12-23 04:47:59

Answer 2

+2 A:

When you process an input file with while ( <$filehandle> ), it only takes the file one line at a time (for each iteration of the loop), so you don't need to worry about it clogging up your memory. Not so with a for loop, which slurps the whole file into memory. Use a regex or whatever else to find what you're looking for and put that in a variable/array/hash or write it out to a new file.

carillonator 2009-12-22 02:38:23

Answer 3

+5 A:

A scan over the whole file may be the fastest way. You can also try File::Sorted, which will do a binary search for a given record. Locating one record in a 25 million line file should require about 15-20 seeks for each record. This means that to search for 25,000 records, you would only need around .5 million seeks/comparison, compared to 25,000,000 to naively examine each row.

Disk IO being what it is, you may want to try the easy way first, but File::Sorted is a theoretical win.

jrockway 2009-12-22 06:56:15

Your File::Sorted is using Moose which is not currently installed on our production environment. I found File::SortedSeek on CPAN which seems to do a binary search - does it serve the same purpose?And my testing on the speed performance and memory footprint is looking good.

est 2009-12-23 04:09:06

Answer 4

+2 A:

You don't want to search the file, so do what you can to avoid it. We don't know much about your problem, but here are some tricks I've used in previous problems, all of which try to do work ahead of time:

Break up the file into a database. That could be SQLite, even.
Pre-index the file based on the data that you want to search.
Cache the results from previous searches.
Run common searches ahead of time, automatically.

All of these trade storage space to for speed. Some some these I would set up as overnight jobs so they were ready for people when they came into work.

You mention that you have structured data, but don't say any more. Is each line a complete record? How often does this file change?

brian d foy 2009-12-23 02:01:58

currently I'm quite happy with the performance of binary search (as suggested by jrockway) rather than slurping the whole 20 million lines into hash.the large file changes quarterly, so I will move this to use SQLite or DBM.

est 2009-12-23 04:39:55

Answer 5

+2 A:

Sounds like you really want a database. Consider SQLite, using Perl's DBI and DBD::SQLite modules.

Randal Schwartz 2009-12-23 02:04:08

ansaurus

tags:

views:

answers:

How can I search a large sorted file in Perl?

related questions