views:

3174

answers:

5

I'm maintaining a script that can get its input from various sources, and works on it per line. Depending on the actual source used, linebreaks might be Unix-style, Windows-style or even, for some aggregated input, mixed(!).

When reading from a file it goes something like this:

@lines = <IN>;
process(\@lines);

...

sub process {
    @lines = shift;
    foreach my $line (@{$lines}) {
        chomp $line;
        #Handle line by line
    }
}

So, what I need to do is replace the chomp with something that removes either Unix-style or Windows-style linebreaks. I'm coming up with way too many ways of solving this, one of the usual drawbacks of Perl :)

What's your opinion on the neatest way to chomp off generic linebreaks? What would be the most efficient?

Edit: A small clarification - the method 'process' gets a list of lines from somewhere, not nessecarily read from a file. Each line might have

  • No trailing linebreaks
  • Unix-style linebreaks
  • Windows-style linebreaks
  • Just Carriage-Return (when original data has Windows-style linebreaks and is read with $/ = '\n')
  • An aggregated set where lines have different styles
+4  A: 
$line =~ s/[\r\n]+//g;
dsm
+6  A: 

Reading perlport I'd suggest something like

$line =~ s/\015?\012?$//;

to be safe for whatever platform you're on and whatever linefeed style you may be processing because what's in \r and \n may differ through different Perl flavours.

Olfan
Potential bugs: 1) No /g , so it wont work on multi-line strings. 2) $ , so it will only match delimiters that occur directly before the end of the string. 3) fixed \015 \012 order, so that if they have \012\015 it will only eat one of them.
Kent Fredric
1)+2) As I don't know what's inside the lines I had to assume there may be newline characters inside that shouldn't be removed (e.g. database records with linebreaking data columns). My intention was to match chomp()s behaviour as closely as possible.3) I've seen old Macs use \015 only and Windows still uses \015\012, but I have yet to see a real world system using \012\015, so I felt this order would be safe. ;)
Olfan
Have a look at my updated answer and what it emits, there are conditions that are *especially* prevalent in line-based reading that really aren't obvious till you try it. ie: local $/ = "\015" # suddenly you have a lot of \012 appear in output.
Kent Fredric
Careful! Simply merging two lines will join the last "word" of line X with the "first" word on line X+1. Depending on context you might want to not remove, but replace with a SPACE (or other delimiter)
lexu
+5  A: 

After digging a bit through the perlre docs a bit, I'll present my best suggestion so far that seems to work pretty good. Perl 5.10 added the \R character class as a generalized linebreak:

$line =~ s/\R//g;

It's the same as:

(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])

I'll keep this question open a while yet, just to see if there's more nifty ways waiting to be suggested.

Christoffer
I encourage you to accept your own answer if it works for you. \R may not work as expected on some exotic platforms (which is why I suggested the hardwired approach earlier), but if you're not into writing portable code but just want to get the job done, you're done here.You might consider putting Kent Fredric's test files through your code first because they really are a good test case.
Olfan
+1  A: 

extending on your answer

use File::Slurp ();
my $value = File::Slurp::slurp($filename);
$value =~ s/\R*//g;

File::Slurp abstracts away the File IO stuff and just returns a string for you.

NOTE

  1. Important to note the addition of /g , without it, given a multi-line string, it will only replace the first offending character.

  2. Also, the removal of $, which is redundant for this purpose, as we want to strip all line breaks, not just line-breaks before whatever is meant by $ on this OS.

  3. In a multi-line string, $ matches the end of the string and that would be problematic ).

  4. Point 3 means that point 2 is made with the assumption that you'd also want to use /m otherwise '$' would be basically meaningless for anything practical in a string with >1 lines, or, doing single line processing, an OS which actually understands $ and manages to find the \R* that proceed the $

Examples

while( my $line = <$foo> ){
      $line =~ $regex;
}

Given the above notation, an OS which does not understand whatever your files '\n' or '\r' delimiters, in the default scenario with the OS's default delimiter set for $/ will result in reading your whole file as one contiguous string ( unless your string has the $OS's delimiters in it, where it will delimit by that )

So in this case all of these regex are useless:

  • /\R*$// : Will only erase the last sequence of \R in the file
  • /\R*// : Will only erase the first sequence of \R in the file
  • /\012?\015?// : When will only erase the first 012\015 , \012 , or \015 sequence, \015\012 will result in either \012 or \015 being emitted.

  • /\R*$// : If there happens to be no byte sequences of '\015$OSDELIMITER' in the file, then then NO linebreaks will be removed except for the OS's own ones.

It would appear nobody gets what I'm talking about, so here is example code, that is tested to NOT remove line feeds. Run it, you'll see that it leaves the linefeeds in.

#!/usr/bin/perl 

use strict;
use warnings;

my $fn = 'TestFile.txt';

my $LF = "\012";
my $CR = "\015";

my $UnixNL = $LF;
my $DOSNL  = $CR . $LF;
my $MacNL  = $CR;

sub generate { 
    my $filename = shift;
    my $lineDelimiter = shift;

    open my $fh, '>', $filename;
    for ( 0 .. 10 )
    {
        print $fh "{0}";
        print $fh join "", map { chr( int( rand(26) + 60 ) ) } 0 .. 20;
        print $fh "{1}";
        print $fh $lineDelimiter->();
        print $fh "{2}";
    }
    close $fh;
}

sub parse { 
    my $filename = shift;
    my $osDelimiter = shift;
    my $message = shift;
    print "Parsing $message File $filename : \n";

    local $/ = $osDelimiter;

    open my $fh, '<', $filename;
    while ( my $line = <$fh> )
    {

        $line =~ s/\R*$//;
        print ">|" . $line . "|<";

    }
    print "Done.\n\n";
}


my @all = ( $DOSNL,$MacNL,$UnixNL);
generate 'Windows.txt' , sub { $DOSNL }; 
generate 'Mac.txt' , sub { $MacNL };
generate 'Unix.txt', sub { $UnixNL };
generate 'Mixed.txt', sub {
    return @all[ int(rand(2)) ];
};


for my $os ( ["$MacNL", "On Mac"], ["$DOSNL", "On Windows"], ["$UnixNL", "On Unix"]){
    for ( qw( Windows Mac Unix Mixed ) ){
        parse $_ . ".txt", @{ $os };
    }
}

For the CLEARLY Unprocessed output, see here: http://pastebin.com/f2c063d74

Note there are certain combinations that of course work, but they are likely the ones you yourself naívely tested.

Note that in this output, all results must be of the form >|$string|<>|$string|< with NO LINE FEEDS to be considered valid output.

and $string is of the general form {0}$data{1}$delimiter{2} where in all output sources, there should be either :

  1. Nothing between {1} and {2}
  2. only |<>| between {1} and {2}
Kent Fredric
If you strip *every* new-line before working on its content, how do you know where the line breaks where (say for instance that a line break constitutes a new record)?
Anon
the task is to remove *all* linefeed regardless of current OS
Kent Fredric
No, the task is to remove trailing linefeeds from a list of strings.
Christoffer
your whole proposal is flawed then. because if your read-line feed delimiter is \015 and \015 \012 is seen , the \012 will *NEVER* be removed because it is at the *START* of the string, *NOT* the *END*
Kent Fredric
Well, actually running the code, using s/\R*$// removes \015, \015\012 and \012 from the lines.
Christoffer
Of course it does, because your $/ is still \012 , not \015
Kent Fredric
Ah, I see now. +1 for the great example.
Christoffer