ansaurus

Question

Neatest way to remove linebreaks in Perl

Answer 1

+4 A:

$line =~ s/[\r\n]+//g;

dsm 2009-05-19 09:41:59

Answer 2

+1 A:

See http://stackoverflow.com/questions/797993/how-do-i-chomp-a-string-if-i-have-perl-4

Alnitak 2009-05-19 10:20:14

Answer 3

+6 A:

Reading perlport I'd suggest something like

$line =~ s/\015?\012?$//;

to be safe for whatever platform you're on and whatever linefeed style you may be processing because what's in \r and \n may differ through different Perl flavours.

Olfan 2009-05-19 10:37:45

Potential bugs: 1) No /g , so it wont work on multi-line strings. 2) $ , so it will only match delimiters that occur directly before the end of the string. 3) fixed \015 \012 order, so that if they have \012\015 it will only eat one of them.

Kent Fredric 2009-05-19 17:36:37

1)+2) As I don't know what's inside the lines I had to assume there may be newline characters inside that shouldn't be removed (e.g. database records with linebreaking data columns). My intention was to match chomp()s behaviour as closely as possible.3) I've seen old Macs use \015 only and Windows still uses \015\012, but I have yet to see a real world system using \012\015, so I felt this order would be safe. ;)

Olfan 2009-05-20 08:39:09

Have a look at my updated answer and what it emits, there are conditions that are *especially* prevalent in line-based reading that really aren't obvious till you try it. ie: local $/ = "\015" # suddenly you have a lot of \012 appear in output.

Kent Fredric 2009-05-20 17:18:00

Careful! Simply merging two lines will join the last "word" of line X with the "first" word on line X+1. Depending on context you might want to not remove, but replace with a SPACE (or other delimiter)

lexu 2009-05-21 05:34:19

Answer 4

+5 A:

After digging a bit through the perlre docs a bit, I'll present my best suggestion so far that seems to work pretty good. Perl 5.10 added the \R character class as a generalized linebreak:

$line =~ s/\R//g;

It's the same as:

(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])

I'll keep this question open a while yet, just to see if there's more nifty ways waiting to be suggested.

Christoffer 2009-05-19 11:14:17

I encourage you to accept your own answer if it works for you. \R may not work as expected on some exotic platforms (which is why I suggested the hardwired approach earlier), but if you're not into writing portable code but just want to get the job done, you're done here.You might consider putting Kent Fredric's test files through your code first because they really are a good test case.

Olfan 2009-05-22 10:01:36

Answer 5

+1 A:

extending on your answer

use File::Slurp ();
my $value = File::Slurp::slurp($filename);
$value =~ s/\R*//g;

File::Slurp abstracts away the File IO stuff and just returns a string for you.

NOTE

Important to note the addition of /g , without it, given a multi-line string, it will only replace the first offending character.
Also, the removal of $, which is redundant for this purpose, as we want to strip all line breaks, not just line-breaks before whatever is meant by $ on this OS.
In a multi-line string, $ matches the end of the string and that would be problematic ).
Point 3 means that point 2 is made with the assumption that you'd also want to use /m otherwise '$' would be basically meaningless for anything practical in a string with >1 lines, or, doing single line processing, an OS which actually understands $ and manages to find the \R* that proceed the $

Examples

while( my $line = <$foo> ){
      $line =~ $regex;
}

Given the above notation, an OS which does not understand whatever your files '\n' or '\r' delimiters, in the default scenario with the OS's default delimiter set for $/ will result in reading your whole file as one contiguous string ( unless your string has the $OS's delimiters in it, where it will delimit by that )

So in this case all of these regex are useless:

/\R*$// : Will only erase the last sequence of \R in the file
/\R*// : Will only erase the first sequence of \R in the file
/\012?\015?// : When will only erase the first 012\015 , \012 , or \015 sequence, \015\012 will result in either \012 or \015 being emitted.
/\R*$// : If there happens to be no byte sequences of '\015$OSDELIMITER' in the file, then then NO linebreaks will be removed except for the OS's own ones.

It would appear nobody gets what I'm talking about, so here is example code, that is tested to NOT remove line feeds. Run it, you'll see that it leaves the linefeeds in.

#!/usr/bin/perl 

use strict;
use warnings;

my $fn = 'TestFile.txt';

my $LF = "\012";
my $CR = "\015";

my $UnixNL = $LF;
my $DOSNL  = $CR . $LF;
my $MacNL  = $CR;

sub generate { 
    my $filename = shift;
    my $lineDelimiter = shift;

    open my $fh, '>', $filename;
    for ( 0 .. 10 )
    {
        print $fh "{0}";
        print $fh join "", map { chr( int( rand(26) + 60 ) ) } 0 .. 20;
        print $fh "{1}";
        print $fh $lineDelimiter->();
        print $fh "{2}";
    }
    close $fh;
}

sub parse { 
    my $filename = shift;
    my $osDelimiter = shift;
    my $message = shift;
    print "Parsing $message File $filename : \n";

    local $/ = $osDelimiter;

    open my $fh, '<', $filename;
    while ( my $line = <$fh> )
    {

        $line =~ s/\R*$//;
        print ">|" . $line . "|<";

    }
    print "Done.\n\n";
}


my @all = ( $DOSNL,$MacNL,$UnixNL);
generate 'Windows.txt' , sub { $DOSNL }; 
generate 'Mac.txt' , sub { $MacNL };
generate 'Unix.txt', sub { $UnixNL };
generate 'Mixed.txt', sub {
    return @all[ int(rand(2)) ];
};


for my $os ( ["$MacNL", "On Mac"], ["$DOSNL", "On Windows"], ["$UnixNL", "On Unix"]){
    for ( qw( Windows Mac Unix Mixed ) ){
        parse $_ . ".txt", @{ $os };
    }
}

For the CLEARLY Unprocessed output, see here: http://pastebin.com/f2c063d74

Note there are certain combinations that of course work, but they are likely the ones you yourself naívely tested.

Note that in this output, all results must be of the form >|$string|<>|$string|< with NO LINE FEEDS to be considered valid output.

and $string is of the general form {0}$data{1}$delimiter{2} where in all output sources, there should be either :

Nothing between {1} and {2}
only |<>| between {1} and {2}

Kent Fredric 2009-05-19 17:35:03

If you strip *every* new-line before working on its content, how do you know where the line breaks where (say for instance that a line break constitutes a new record)?

Anon 2009-05-19 21:26:57

the task is to remove *all* linefeed regardless of current OS

Kent Fredric 2009-05-20 00:21:41

No, the task is to remove trailing linefeeds from a list of strings.

Christoffer 2009-05-20 08:05:57

your whole proposal is flawed then. because if your read-line feed delimiter is \015 and \015 \012 is seen , the \012 will *NEVER* be removed because it is at the *START* of the string, *NOT* the *END*

Kent Fredric 2009-05-20 09:14:36

Well, actually running the code, using s/\R*$// removes \015, \015\012 and \012 from the lines.

Christoffer 2009-05-20 10:48:52

Of course it does, because your $/ is still \012 , not \015

Kent Fredric 2009-05-20 16:33:24

Ah, I see now. +1 for the great example.

Christoffer 2009-05-21 07:28:37

ansaurus

tags:

views:

answers:

Neatest way to remove linebreaks in Perl

related questions