tags:

views:

134

answers:

5

I have a <textarea> for user input, and, as they are invited to do, users liberally add line breaks in the browser and I save this data directly to the database.

Upon displaying this data back on a webpage, I need to convert the line breaks to <br> tags in a reliable way that takes into consideration to \n's the \r\n's and any other common line break sequences employed by client systems.

What is the best way to do this in Perl without doing regex substitutions every time? I am hoping, naturally, for yet another awesome CPAN module recommendation... :)

+10  A: 

There's nothing wrong with using regexes here:

s/\r?\n/<br>/g;
Ether
+1  A: 

Looking into this matter i found the follow modules:

http://search.cpan.org/~rubykat/txt2html-2.51/lib/HTML/TextToHTML.pm / http://search.cpan.org/~rubykat/txt2html-2.51/scripts/txt2html

http://search.cpan.org/~cwest/HTML-FromText-2.05/lib/HTML/FromText.pm

Both were reviewed and you can check it at cpan.

But i cannot say much about it nor how reliable it will be against newlines.

Prix
+3  A: 

Actually, if you're having to deal with Mac users, or if there still happens to be some weird computer that uses form-feeds, you would probably have to use something like this:

$input =~ s/(\r\n|\n|\r|\f)/<br>/g;
Bushman
Macs haven't used CRs for many many years. Nowadays it's windows vs. the rest of the world.
Ether
@Ether: All internet text protocols use the same \r\n system as Windows, so it's actually more a case of "Windows and Internet protocols versus Unix".
Kinopiko
+3  A: 
#!/usr/bin/perl

use strict; use warnings;

use Socket qw( :crlf );

my $text = "a${CR}b${CRLF}c${LF}";

$text =~ s/$LF|$CR$LF?/<br>/g;

print $text;

Following up on @daxim's comment, here is the modified version:

#!/usr/bin/perl

use strict; use warnings;
use charnames ':full';

my $text = "a\N{CR}b\N{CR}\N{LF}c\N{LF}";

$text =~ s/\N{LF}|\N{CR}\N{LF}?/<br>/g;

print $text;

Following up on @Marcus's comment here is a contrived example:

#!/usr/bin/perl

use strict; use warnings;
use charnames ':full';

my $t = (my $s = "a\012\015\012b\012\012\015\015c");
$s =~ s/\r?\n/<br>/g;

$t =~ s/\N{LF}|\N{CR}\N{LF}?/<br>/g;

print "This is \$s: $s\nThis is \$t:$t\n";

This is a mismash of carriage returns and line feeds (which, at some point in the past, I did encounter).

Here is the output of the script on Windows using ActiveState Perl:

C:\Temp> t | xxd
0000000: 5468 6973 2069 7320 2473 3a20 613c 6272  This is $s: a<br
0000010: 3e3c 6272 3e62 3c62 723e 3c62 723e 0d0d  ><br>b<br><br>..
0000020: 630d 0a54 6869 7320 6973 2024 743a 613c  c..This is $t:a<
0000030: 6272 3e3c 6272 3e62 3c62 723e 3c62 723e  br><br>b<br><br>
0000040: 3c62 723e 3c62 723e 630d 0a              <br><br>c..

or, as text:

chis is $s: a<br><br>b<br><br>
This is $t:a<br><br>b<br><br><br><br>c

Admittedly, you are not likely to end up with this input. However, if you want to cater for any unexpected oddities that might indicate a line ending, you might want to use

$s =~ s/\N{LF}|\N{CR}\N{LF}?/<br>/g;

Also, for reference, CGI.pm canonicalizes line-endings this way:

# Define the CRLF sequence.  I can't use a simple "\r\n" because the meaning
# of "\n" is different on different OS's (sometimes it generates CRLF, sometimes LF
# and sometimes CR).  The most popular VMS web server
# doesn't accept CRLF -- instead it wants a LR.  EBCDIC machines don't
# use ASCII, so \015\012 means something different.  I find this all 
# really annoying.
$EBCDIC = "\t" ne "\011";
if ($OS eq 'VMS') {
  $CRLF = "\n";
} elsif ($EBCDIC) {
  $CRLF= "\r\n";
} else {
  $CRLF = "\015\012";
}
Sinan Ünür
Import from [`charnames`](http://perldoc.perl.org/charnames.html) is a better choice for named constants than `Socket`.
daxim
I am intrigued by this solution and am wondering what advantages it has over the simple regex above? Do the named constants account for a wider array of line break character possibilities?
Marcus
@Marcus The pattern itself also handles the Mac OS 9 style line breaks consisting simply of a carriage return as well. As for using the character codes rather than `\r` and `\n`, see the update to my post.
Sinan Ünür
A: 
Dave Sherohman