tags:

views:

1969

answers:

11

Can someone give some hints of how to delete the last n lines from a file in Perl? I have a very large file of around 400 MB, and I want to delete some 125,000 last lines from it.

+13  A: 
Svante
+1 I like the shell idea. That would have been my initial approach. Especially if it's a one off thing.
Chris Kloberdanz
yeah I just used wc and head and it seems working.. :)
Alien01
Actually, I think that in this case the perl script scales better, because it doesn't write the file anew.
Svante
yes..but I cannot find tie ..I am using perl on windows...
Alien01
No need to use wc: head -n -5 FILE > NEWFILE ... will give you a FILE in NEWFILE minus the last 5 lines
grepsedawk
Install Tie::File from CPAN
Svante
+4  A: 

Do you know how many lines there are, or is there any other clue about this file? Do you have to do this over-and-over again, or is it just one time?

If I had to do this once, I'd load the file in vim, look at the last line number, then delete from the last line I want until the end:

:1234567,$d

The general programming way is to do it in two passes: one to determine the number of lines, and then one to get rid of the lines.

The simple way is to print the right number of lines to a new file. It's only efficient in terms of cycles and maybe a bit of disk thrashing, but most people have plenty of those. Some of the stuff in perlfaq5 should help. You get the job done and you get on with life.

while(  )
   {
   print $out;
   last if $. > $last_line_I_want;
   }

If this is something you have to do a lot or the data size is too large to rewrite it, you can create an index of lines and byte offsets and truncate() the file to the right size. As you keep the index, you only have to discover the new line endings because you already know where you left off. Some file-handling modules can handle all of that for you.

brian d foy
+3  A: 
  1. go to the end of the file: fseek
  2. count backwards that many lines
  3. find out the file position: ftell
  4. truncate file to that position as length: ftruncate
yogman
+2  A: 

I would just use a shell script for this problem:

tac file | sed '1,125000d' | tac

(tac is like cat but prints lines in reverse order. By Jay Lepreau and David MacKenzie. Part of GNU coreutils.)

Norman Ramsey
You have to pipe it into a file at the end. Also, you don't need to use this tac hack, head does what you want (also from coreutils).
Svante
Sure or you sould use the overwrite script from the Kernighan and Pike book. Color me stupid, but how can head do this? (tac is not a hack; it's often useful. I've had my private version [called revlines] for years. I'm pleased to see it in coreutils.)
Norman Ramsey
@Norman - the trick is to supply a negative argument to the -n option. From head(1): "with the leading ‘-’, print all but the last N lines of each file"
converter42
Oh, sweet! head must have acquired -n when I wasn't looking. Because I'm such a dinosaur I still write stuff like 'head -3 */README'. I see that's not even mentioned on the man page any more. Thank you for teaching me something new.
Norman Ramsey
A: 

The most efficient way would be to seek to the end of the file, then incrementally read segments, while counting the number of newlines in each, and then use truncate (see perldoc -f truncate) to trim it down. There is also a module or two on CPAN for reading a file backwards.

Shlomi Fish
+5  A: 

As folks have suggested Tie::Array already, which does the job well, I'll lay out the basic algorithm should you want to do it by hand. There are sloppy, slow ways to do it that work well for small files. Here's the efficient way to do it for large files.

  1. Find the position in the file just before the Nth line from the end.
  2. Truncate everything after that point (using truncate()).

1 is the tricky part. We don't know how many lines there are in the file or where they are. One way is to count all the lines up and then go back to the Nth. This means we have to scan the whole file every time. More efficient would be to read backwards from the end of the file. You can do this with read() but it's easier to use File::ReadBackwards which can go backwards line by line (while still using efficient buffered reads).

This means you read just 125,000 lines rather than the whole file. truncate() should be O(1) and atomic and cost almost nothing no matter how large the file. It simply resets the size of the file.

#!/usr/bin/perl

use strict;
use warnings;

use File::ReadBackwards;

my $LINES = 10;     # Change to 125_000 or whatever
my $File = shift;   # file passed in as argument

my $rbw = File::ReadBackwards->new($File) or die $!;

# Count backwards $LINES or the beginning of the file is hit
my $line_count = 0;
until( $rbw->eof || $line_count == $LINES ) {
    $rbw->readline;
    $line_count++;
}

# Chop off everything from that point on.
truncate($File, $rbw->tell) or die "Could not truncate! $!";
Schwern
I would worry about scalabilty with tie. You cannot handle a file bigger than the available virtual memory, plus you need to read everything which can be timeconsuming on a large file. The fseek soltion is both fast and scaleable.
James Anderson
I think you're confusing tie with something else. The File::ReadBackwards using a tie to give you an filehandle interface, but it doesn't read the entire file into memory. It's reading from the end of the file as you need it (using seek, etc).
brian d foy
A: 

Schwern: Are the use Fnctl and $rbw->get_handle lines in your script necessary? Also, I'd recommend reporting truncate errors in the case it doesn't return true.

-- Douglas Hunter (who would have commented on that post if he could have)

douglashunter
Those lines weren't necessary. I suspect Schwern first tried to truncate the filehandle directly before switching to truncating it by filename.
brian d foy
A: 

Try this code:

my $i =0 ;
sed -i '\$d' filename while( $i++ < n ) ;

backquotes will also be there but i am unable to get them printed :(

Neeraj
A: 

try this

:|dd of=urfile seek=1 bs=$(($(stat -c%s urfile)-$(tail -1 urfile|wc -c)))
nighteblis
A: 

My suggestion, using ed:

printf '$-125000,$d\nw\nq\n' | ed -s myHugeFile
mouviciel
A: 

This example code will keep the index of the last 10 lines, as it scans the file. Then it uses the earliest index in the buffer, to truncate the file. This of course will only work if truncate works on your system.

#! /usr/bin/env perl
use strict;
use warnings;
use autodie;

open my $file, '+<', 'test.in'; # rw
my @list;
while(<$file>){
  if( @list <= 10 ){
    push @list, tell $file;
  }else{
    (undef,@list) = (@list,tell $file);
  }
}

seek $file, 0, 0;
truncate $file, $list[0] if @list;
close $file;

This has the added benefit that it only uses up enough memory for the last ten indexes, and the current line.

Brad Gilbert