tags:

views:

163

answers:

2

I have a lot of HTML files which have unwanted line-feeds. These break things like inline javascript and formatting within the pages. I want to come up with a way to strip out all line feeds from the pages that do not appear directly after an html tag e.g </div>. Does anyone know of a regex and/or program that may be able to acheive this?

+1  A: 

You may be able to use Notepad++'s search/replace function, with a regular expression to catch most of this.

Something like:

([^>])\n(.+)

Replaced with:

\1 \2
DisgruntledGoat
Depending on the format of the html file, you may need to use ([^>])\r\n(.+) or ([^>])\r(.+) instead.
Brian
A: 

You can use a negative lookbehind to match the line feeds

<?php

$buffer = file_get_contents('test.html');

// replace all line feeds not preceded by </div>
$buffer = preg_replace('|(?<!</div>)[\r\n]|', "", $buffer);

file_put_contents('test.new.html', $buffer);
?>

see: http://www.regular-expressions.info/lookaround.html

Lance Rushing
Or just use the RE: (?<!</div>)[\r\n] in your favorite editor.
Lance Rushing
you may actually want something more like (?<!</[^>]+>)(\r?\n){2,}i.e. any closing tag with more than 1 CRLF (where CR is optional)
Neel