What is a cross platform regex for removal of line breaks?

Do check if your regex engine supports \R as a shorthand character class and you will not need to be concerned with the various Unicode newline / linefeed combos. If implemented correctly, you can then match all the various ascii or Unicode line endings transparently using \R.

In Unicode you need to detect NEL (OS/390 line ending, \x85) LS (Line Separator, \x2028) and PS (Paragraph Separator, \x2029) if you want to be completely cross platform these days.

It is debatable whether LS, NEL, and PS should be treated as line breaks, line endings, or white space. The XML 1.0 standard, for example, does not recognize NEL as a line break character. ECMAScript treats LS and PS as line breaks but NEL as whitespace. Perl unicode regexs will treat VT, FF, CR, CRLF, NEL, LS and PS as line breaks for the purpose of ^ and $ regex meta characters.

The Unicode Implementation Guide (section 5.8 and table 5.3) is probably the best bet of what the definitive treatment of what a "newline" is.

If you are only concerned with ascii with the DOS/Windows/Unix/Mac classic variants, the regex equivalent to \R is (?>\r\n|[\r\n])

In Unicode, the equivalent to \R is (?>\r\n|\n|\x0b|\f|\r|\x85|\x2028|\x2029) The \x0b in there is a vertical tab; once again, this may or may not fit you definition of what a line break is, but that does match the recommendation of the Unicode Implantation. (FF, or \x0C is not included in the regex since a Form Feed is a new page, not a new line in the definition.)

This will replace any number of line breaks with one replacement token.

Andreas Jansson 2010-07-10 11:40:47

@Andreas He wants to remove line breaks.

Amarghosh 2010-07-10 11:43:54

Well, he wants to replace them with another delimiter...

Tim Pietzcker 2010-07-10 15:30:07

"utf8" in your answer should be "Unicode". UTF-8 is merely one of the Unicode character encodings.

Alan Moore 2010-07-10 23:14:18

You are right, but the docs I was referring to (PCRE manual) had the same issue! Edit made...

drewk 2010-07-11 15:41:02

ansaurus

tags:

views:

answers:

What is a cross platform regex for removal of line breaks?

related questions