I've had to do this task before on some very large files and the timestamps did not come in order. I didn't want to store it all in memory. I accomplished the task by using a three-pass solution:
- Tag each input line with its timestamp and save in temp file
- Sort the temp file with a fast sorter, like sort(1)
- Turn the sorted file back into the starting format
This was fast enough for my task where I could let it run while I went for a cup of coffee, but you might have to do something more fancy if you need the results really quickly.
use strict;
use warnings;
use File::Temp qw(tempfile);
my( $temp_fh, $temp_filename ) = tempfile( UNLINK => 1 );
# read each line, tag with timestamp, and write to temp file
# will sort and undo later.
my $current_timestamp = '';
LINE: while( <DATA> )
{
chomp;
if( m|^\d\d/\d\d/\d\d \d\d:\d\d:\d\d:\d\d\d\d$| ) # timestamp line
{
$current_timestamp = $_;
next LINE;
}
elsif( m|\S| ) # line with non-whitespace (not a "blank line")
{
print $temp_fh "[$current_timestamp] $_\n";
}
else # blank lines
{
next LINE;
}
}
close $temp_fh;
# sort the file by lines using some very fast sorter
system( "sort", qw(-o sorted.txt), $temp_filename );
# read the sorted file and turn back into starting format
open my($in), "<", 'sorted.txt' or die "Could not read sorted.txt: $!";
$current_timestamp = '';
while( <$in> )
{
my( $timestamp, $line ) = m/\[(.*?)] (.*)/;
if( $timestamp ne $current_timestamp )
{
$current_timestamp = $timestamp;
print $/, $timestamp, $/;
}
print $line, $/;
}
unlink $temp_file, 'sorted.txt';
__END__
01/01/70 12:00:00:0004
This is line 3
This is line 4
This is line 5
01/01/70 12:00:00:0001
This is line 1
This is line 2
01/01/70 12:00:00:0004
This is line 6
This is line 7