views:

321

answers:

5

I would like to know which pattern can I use in sed to make changes in the first line of huge files (~2 GB). The preference for sed is only because I assume it must be faster than a Python or Perl script.

The files have the following structure:

field 1, field 2, ... field n
data

and, given the likelihood of having spaces in the identifier for every field, I need to replace every space by an underscore in this way:

**BEFORE** 
the first name,the second name,the first surname,a nickname, ...
data

**AFTER**
the_first_name,the_second_name,the_first_surname,a_nickname, ...
data

Any pointers to the right pattern to use, or another scripting solution would be great.

Thank you.

+4  A: 

You are unlikely to notice any speed difference between Perl, Python, and sed. Your script will spend most of its time waiting for IO.

If the lines are the same length, you can edit in-place, otherwise you will have to create a new file.

In Perl:

#!/usr/bin/env perl
use strict;

my $filename = shift;
open my $in_fh, '<', $filename
  or die "Cannot open $filename for reading: $!";
my $first_line = <$in_fh>;

open my $out_fh, '>', "$filename.tmp"
  or die "Cannot open $filename.tmp for writing: $!";

$first_line =~ s/some translation/goes here/;

print {$out_fh} $first_line;
print {$out_fh} $_ while <$in_fh>; # sysread/syswrite is probably better

close $in_fh;
close $out_fh;

# overwrite original with modified copy
rename "$filename.tmp", $filename
  or warn "Failed to move $filename.tmp to $filename: $!";
jrockway
+2  A: 

the change you mention (replacing every space by an underscore) doesn't change the line's length, so in theory it could be done inplace.

warning!: untested!

head -n 1 yourfile | sed -e 's/ /_/g' > tmpfile
dd conv=nocreat,notrunc if=tmpfile of=yourfile

i'm not so sure about the conv=... parameters, but it seems that it should make dd overwrite the start of the original file with the transformed line.

please note that if you want to do any other transformation, which could alter the line's length, do not, do not do this. you'd have to do a full copy. something like this:

head -n 1 yourfile | sed -e 's/ /_/g' > tmpfile
tail -n + 2 | cat tmpfile - > transformedfile
Javier
+10  A: 

To edit the first 10 lines

sed -i -e '1,10s/ /_/g'

In Perl, you can use the flip-flop operator in scalar context:

perl -i -pe 's/ /_/g if 1 .. 10'
Leon Timmermans
That re needs a `g` on the end to make it replace all the spaces in the line, not just the first.
jleedev
perl -i -pe 's/ /_/g if 1 .. 10' ??? Wow, I've never heard of this syntax in "if 1..10". Sometimes I get a bit annoyed with Perl. Why all these exceptions? Why not just use a simple if($. < 11)?
@leon: wow, very neat trick!, thank you.
Alex. S.
@dehmann it is a flip-flop operator, see http://perldoc.perl.org/perlop.html very handy
szabgab
+7  A: 

I don't think you want to use any solution that requires the data to be written to a new file.

If you're pretty sure that all you need is to change the spaces into underscores in the first line of the large text files, you only have to read the first line, swap the characters and write it back in place:

#!/usr/bin/env perl
use strict;

my $filename = shift;
open (FH, "+< $filename") || die "can't open $filename: $!";
my $line = <FH>;
$line =~ s/ /_/g;
seek FH, 0, 0; # go back to the start of the file
printf FH $line;
close FH;

To use it, just pass the full path of the file to update:

# fixheader "/path/to/myfile.txt"
Renaud Bompuis
That open || die is incorrect, it evaluates to open FH, ("+< $filename" || die "can't open $filename: $!");Ether use "or" or put parentheses around the parameters of openor both:open( FH, "+< $filename") or die "can't open $filename: $!";
szabgab
True, thanks for noticing the bug.
Renaud Bompuis
That was going to be my solution as well. +1
Axeman
A: 

This could be a solution :


use Tie::File;
tie my @array,"Tie::File","path_to_file";
$array[0] = "new text";
untie @array;

Tie::File is one of the modules I use the most , and it's very simple to use . Each element in the array is a line in the file . One of the downsides , however , would be that this loads the whole file in memory .

Geo
actually it won't load the file if it does not have to, so if you only change the first line and the number of characters does not change this does not have much of an overhead.
szabgab
I think it's pretty rare to have the same number of characters after a line modification .
Geo