ansaurus

Question

How to sort rows in a text file in Perl?

Answer 1

A:

well, i routinely parse very large (600MB) daily Apache log files with Perl, and to store the information i use a hash. I also go through about 30 of these files, in one script instance, using the same hash. its not a big deal assuming you have enough RAM.

recursive9 2010-08-27 19:02:09

Answer 2

+5 A:

To sort the file in your script, you will still have to load the entire thing into memory. If you're doing that, I'm not sure what's the advantage of sorting it vs just loading it into a hash?

Something like this would work:

my %seen;
open(A, "<A.txt") or die "Can't read A: $!";
while (<A>) {
    $seen{$_}=1;
}
close A;

open(B, "<B.txt") or die "Can't read B: $!";
while(<B>) {
  delete $seen{$_};
}
close B;

print "Lines found in A, missing in B:\n";
join "\n", keys %seen;

zigdon 2010-08-27 19:03:17

It isn't necessary to load the entire file into memory at once in order to sort it. That's why the merge sort was invented. But if you have enough memory, this is a good approach.

cjm 2010-08-27 21:11:18

Answer 3

A:

May I ask why you must do this in native Perl? If the cost of calling a system call or 3 is not an issue (e.g. you do this infrequently and not in a tight loop), why not simply do:

my $cmd = "sort $file1 > $file1.sorted";
$cmd .= "; sort $file2 > $file2.sorted";
$cmd .= "; comm -23 $file1.sorted $file2.sorted |wc -l";
my $count = `$cmd`;
$count =~ s/\s+//g;
if ($count != 0) {
    print "Stuff in A exists that aren't in B\n";
}

Please note that comm parameter might be different, depending on what exactly you want.

DVK 2010-08-27 19:22:02

@DVK: This comparison is only a small part of what my Perl script does. These files are generated, compared and processed. As to why I am doing the whole thing in Perl, I need things to be portable, and work on a variety of environments, not just Unix.

Lazer 2010-08-27 19:29:37

Well, both of these (sort and comm and probably wc) should be available on Windows as well, but yeah, this is obviously less portable, you're right. It is however much more optimized for large files. If you need pure Perl portability, the hash solution is your ticket.

DVK 2010-08-27 19:31:42

Answer 4

A:

Test if A.txt is a subset of B.txt

open FILE.B, "B.txt";
open FILE.A, "A.txt";

my %bFile;

while(<FILE.B>) {
   ($process, $id1, $id2) = split /,/;
   $bFile{$process}{$id1}{$id2}++;
}

$missingRows = 0;

while(<FILE.A>) {
   $missingRows++ unless $bFile{$process}{$id1}{$id2};
   # If we've seen a given entry already don't add it
   next if $missingRows; # One miss means they aren't all verified
}

$is_Atxt_Subset_Btxt = $missingRows?FALSE:TRUE;

That will give you a test for all rows in A being in B with only reading in all of B and then testing each member of the array while reading A.

HerbN 2010-08-27 20:39:32

@HerbN: No, I do not add or delete any information to the files. I just rearrange what is already there. The idea is to check whether `A.txt` is a subset of `B.txt` or not.

Lazer 2010-08-28 04:51:15

@Lazer: Updated to do just that.

HerbN 2010-08-28 05:37:36

thaanks @HerbN!

Lazer 2010-08-29 08:43:06

Answer 5

A:

As usual, CPAN has an answer for this. Either Sort::External or File::Sort looks like it would work. I've never had occasion to try either, so I don't know which would be better for you.

Another possibility would be to use AnyDBM_File to create a disk-based hash that can exceed available memory. Without trying it, I couldn't say whether using a DBM file would be faster or slower than the sort, but the code would probably be simpler.

cjm 2010-08-27 21:04:32

Answer 6

A:

Here's another way to do it. The idea is to create a flexible data structure that allows you to answer many kinds of questions easily with grep.

use strict;
use warnings;

my ($fileA, $fileB) = @ARGV;

# Load all lines: $h{LINE}{FILE_NAME} = TALLY
my %h;
$h{$_}{$ARGV} ++ while <>;

# Do whatever you need.
my @all_lines = keys %h;
my @in_both   = grep {     keys %{$h{$_}} == 2       } keys %h;
my @in_A      = grep {     exists $h{$_}{$fileA}     } keys %h;
my @only_in_A = grep { not exists $h{$_}{$fileB}     } @in_A;
my @in_A_mult = grep {            $h{$_}{$fileA} > 1 } @in_A;

FM 2010-08-27 22:50:26

What if a given line occurs twice in one file but zero in the other?

HerbN 2010-08-28 05:42:31

@HerbN It works fine. The N of keys in `%{$h{LINE}}` tells you how many files a particular LINE occurred in. And the tallies are file-specific: each tally tells you the N of times a particular LINE occurred within a particular FILE.

FM 2010-08-28 10:23:44

ansaurus

tags:

views:

answers:

How to sort rows in a text file in Perl?

related questions