views:

252

answers:

5

I have the following regex to try to reduce groups of newlines:

s/(\n|\r\n|\n\r)(\n|\r\n|\n\r)(\n|\r\n|\n\r)+/\n\n/gmi;

It started out as:

s/\n\n(\n)+/\n\n/gmi

I am looking to reduce the number of newlines that are continuous to a maximum of two in a row (just trying to do some cleanup on some files that I am importing for an internal wiki). The data has several lines of CRLF's spread throughout it (Windows data files). Yet, it doesn't seem to work.

What am I doing wrong? Here is a sample where it is coming out wrong:

Starts off as:

added missing options for Menu and toolbar positioning</p>

</div>

</body>

</html>

I am stripping HTML tags, so it ends up like this:

added missing options for Menu and toolbar positioning





Then I apply the regex and it comes out as:

added missing options for Menu and toolbar positioning



+6  A: 

Try also matching any other whitespace left over around those newlines:

s/(\r?\n[ \t]*){2,}/\n\n/g;
Aaron Digulla
beat me to it :P
Robert Greiner
Why not {3,} instead?
Michael Myers
This reduces a set of 6 to a set of 4... I'm not sure why it's not reducing it to just 2?
Daemonic
Useless use of /m; /m only affects what ^ and $ match.
ysth
@Daemonic: works for me: perl -we'$x="\r\n" x 6; $x=~s/(\r?\n){2,}/\n\n/gmi; use Data::Dumper; $Data::Dumper::Useqq=1;print Dumper $x'$VAR1 = "\n\n";
ysth
/i is also unnecessary :) Fixed.
Aaron Digulla
@Daemonic: You have whitespace in the lines. Try my latest version.
Aaron Digulla
The whitespace addition seemed to work. I also noticed that one of the extra enters was coming from a statement a bit further along (I thought I had checked all the modifying code, but I missed a section).
Daemonic
A: 

did you try matching your multiple groups like this?

(\r\n){2,}/\n\n
Robert Greiner
+1  A: 

Since you seem to be having trouble applying the answers given, maybe you could show us some of your actual data, with

use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper $slurped_file;

You may also want to try one pass removing any \r characters, and then your original newline-only substitution.

ysth
Starts off as:<pre>added missing options for Menu and toolbar positioning</p></div></body></html></pre>I am stripping HTML tags, so it ends up like this:<pre>added missing options for Menu and toolbar positioning</pre>Then I apply the regex and it comes out as:<pre>added missing options for Menu and toolbar positioning</pre>
Daemonic
hmmm... pre's don't work out in comments. I'll add that to the original question.
Daemonic
A: 

basically, what you need is:

s/(\r\n|\n\r|\n){3,}/\n\n/g;

Please notice the order of alternations - you have to start with the longest ones first.

depesz
A: 

Do you mean /s instead of /m?

Alex Feinman