views:

468

answers:

5

I have a very large column-delimited file coming out of a database report in something like this:

field1,field2,field3,metricA,value1
field1,field2,field3,metricB,value2

I want the new file to have combine lines like this so it would look something like this:

field1,field2,field3,value1,value2

I'm able to do this using a hash. In this example, the first three fields are the key and I combine value1 and value in a certain order to be the value. After I've read in the file, I just print out the hash table's keys and values into another file. Works fine.

However, I have some concerns since my file is going to be very large. About 8 GB per file.

Would there be a more efficient way of doing this? I'm not thinking in terms of speed, but in terms of memory footprint. I'm concerned that this process could die due to memory issues. I'm just drawing a blank in terms of a solution that would work but wouldn't shove everything into, ultimately, a very large hash.

For full-disclosure, I'm using ActiveState Perl on Windows.

+3  A: 

Would it not be better to make another export directly from the database into your new file instead of reworking the file you have already output. If this is an option then I would go that route.

Chris Ballance
I would love to do just that, but I can't get direct access to the database.
geoffrobinson
So how do you get the first file?
Chris Ballance
I get the file via some http interface the database owner has provided. I specify which metrics I want and it puts a metric on each line. I wish it would just add columns to the result it outputs, but it isn't meant to be.
geoffrobinson
Sounds like more of a job for the database owner than for you. It would be exceedingly easier to do on his end.
Chris Ballance
+6  A: 

If your rows are sorted on the key, or for some other reason equal values of field1,field2,field3 are adjacent, then a state machine will be much faster. Just read over the lines and if the fields are the same as the previous line, emit both values.

Otherwise, at least, you can take advantage of the fact that you have exactly two values and delete the key from your hash when you find the second value -- this should substantially limit your memory usage.

Joel Hoffman
It isn't sorted in a manner to let me do the first option. But I think I will delete the entry from the hash table once I determine it has been accessed twice. Seems like the best bet.
geoffrobinson
You can always sort it with the external sort program, or use Sort::External as suggested by Axeman. If a matching pair of rows can occur far apart in the file, then I'd almost guarantee this will be faster than using a hash on a file this size.
j_random_hacker
The hash method may be n^2 (looking n items up in an O(n) datastructure), while the sort may be n log n, but on the other hand, if the hash fits in memory, the sort may involve a lot more disk I/O. Hard to say which would be faster.
Joel Hoffman
If pairs are very far apart in the list, the hash probably won't fit in memory. Worst case is if they're separated by half the file on average, but perhaps they're only separated by a much smaller fraction, on average.
Joel Hoffman
@Joel: The hash is theoretically O(1) per lookup for O(n) overall -- the problem is that this will exaust RAM on a typical PC for an 8Gb file, unless you get lucky and each pair of lines occurs one after the other.
j_random_hacker
Basically don't fully load either file into memory.
Brad Gilbert
+5  A: 

If you had other Unix like tools available (for example via cygwin) then you could sort the file beforehand using the sort command (which can cope with huge files). Or possibly you could get the database to output the sorted format.

Once the file is sorted, doing this sort of merge is then easy - iterate down a line at a time, keeping the last line and the next line in memory, and output whenever the keys change.

Nick Fortescue
+2  A: 

You could try something with Sort::External. It reminds me of a mainframe sort that you can use right in the program logic. It's worked pretty well for what I've used it for.

Axeman
+4  A: 

If you don't think the data will fit in memory, you can always tie your hash to an on-disk database:

use BerkeleyDB;
tie my %data, 'BerkeleyDB::Hash', -Filename => 'data';

while(my $line = <>){
    chomp $line;
    my @columns = split /,/, $line; # or use Text::CSV_XS to parse this correctly

    my $key = join ',', @columns[0..2];
    my $a_key = "$key:metric_a";
    my $b_key = "$key:metric_b";

    if($columns[3] eq 'A'){
        $data{$a_key} = $columns[4];
    }
    elsif($columns[3] eq 'B'){
        $data{$b_key} = $columns[4];
    }

    if(exists $data{$a_key} && exists $data{$b_key}){
        my ($a, $b) = map { $data{$_} } ($a_key, $b_key);
        print "$key,$a,$b\n";
        # optionally delete the data here, if you don't plan to reuse the database
    }
}
jrockway
I voted this up because it comes at the problem from another angle. Think Different.
Brad Gilbert