ansaurus

Question

How can I merge lines in a large, unsorted file without running out of memory in Perl?

Answer 1

+3 A:

Would it not be better to make another export directly from the database into your new file instead of reworking the file you have already output. If this is an option then I would go that route.

Chris Ballance 2009-02-13 20:13:39

I would love to do just that, but I can't get direct access to the database.

geoffrobinson 2009-02-13 20:32:41

So how do you get the first file?

Chris Ballance 2009-02-13 20:35:52

I get the file via some http interface the database owner has provided. I specify which metrics I want and it puts a metric on each line. I wish it would just add columns to the result it outputs, but it isn't meant to be.

geoffrobinson 2009-02-13 20:51:14

Sounds like more of a job for the database owner than for you. It would be exceedingly easier to do on his end.

Chris Ballance 2009-02-13 20:53:24

Answer 2

+6 A:

If your rows are sorted on the key, or for some other reason equal values of field1,field2,field3 are adjacent, then a state machine will be much faster. Just read over the lines and if the fields are the same as the previous line, emit both values.

Otherwise, at least, you can take advantage of the fact that you have exactly two values and delete the key from your hash when you find the second value -- this should substantially limit your memory usage.

Joel Hoffman 2009-02-13 20:17:36

It isn't sorted in a manner to let me do the first option. But I think I will delete the entry from the hash table once I determine it has been accessed twice. Seems like the best bet.

geoffrobinson 2009-02-13 20:52:50

You can always sort it with the external sort program, or use Sort::External as suggested by Axeman. If a matching pair of rows can occur far apart in the file, then I'd almost guarantee this will be faster than using a hash on a file this size.

j_random_hacker 2009-02-14 15:01:17

The hash method may be n^2 (looking n items up in an O(n) datastructure), while the sort may be n log n, but on the other hand, if the hash fits in memory, the sort may involve a lot more disk I/O. Hard to say which would be faster.

Joel Hoffman 2009-02-14 17:54:55

If pairs are very far apart in the list, the hash probably won't fit in memory. Worst case is if they're separated by half the file on average, but perhaps they're only separated by a much smaller fraction, on average.

Joel Hoffman 2009-02-14 17:58:55

@Joel: The hash is theoretically O(1) per lookup for O(n) overall -- the problem is that this will exaust RAM on a typical PC for an 8Gb file, unless you get lucky and each pair of lines occurs one after the other.

j_random_hacker 2009-02-14 18:00:44

Basically don't fully load either file into memory.

Brad Gilbert 2009-02-15 22:14:16

Answer 3

+5 A:

If you had other Unix like tools available (for example via cygwin) then you could sort the file beforehand using the sort command (which can cope with huge files). Or possibly you could get the database to output the sorted format.

Once the file is sorted, doing this sort of merge is then easy - iterate down a line at a time, keeping the last line and the next line in memory, and output whenever the keys change.

Nick Fortescue 2009-02-13 20:20:21

Answer 4

+2 A:

You could try something with Sort::External. It reminds me of a mainframe sort that you can use right in the program logic. It's worked pretty well for what I've used it for.

Axeman 2009-02-13 22:30:23

Answer 5

+4 A:

If you don't think the data will fit in memory, you can always tie your hash to an on-disk database:

use BerkeleyDB;
tie my %data, 'BerkeleyDB::Hash', -Filename => 'data';

while(my $line = <>){
    chomp $line;
    my @columns = split /,/, $line; # or use Text::CSV_XS to parse this correctly

    my $key = join ',', @columns[0..2];
    my $a_key = "$key:metric_a";
    my $b_key = "$key:metric_b";

    if($columns[3] eq 'A'){
        $data{$a_key} = $columns[4];
    }
    elsif($columns[3] eq 'B'){
        $data{$b_key} = $columns[4];
    }

    if(exists $data{$a_key} && exists $data{$b_key}){
        my ($a, $b) = map { $data{$_} } ($a_key, $b_key);
        print "$key,$a,$b\n";
        # optionally delete the data here, if you don't plan to reuse the database
    }
}

jrockway 2009-02-14 02:38:19

I voted this up because it comes at the problem from another angle. Think Different.

Brad Gilbert 2009-02-15 22:15:57

ansaurus

tags:

views:

answers:

How can I merge lines in a large, unsorted file without running out of memory in Perl?

related questions