I have a lot of HTML files which have unwanted line-feeds. These break things like inline javascript and formatting within the pages. I want to come up with a way to strip out all line feeds from the pages that do not appear directly after an html tag e.g </div>
. Does anyone know of a regex and/or program that may be able to acheive this?
views:
163answers:
2
+1
A:
You may be able to use Notepad++'s search/replace function, with a regular expression to catch most of this.
Something like:
([^>])\n(.+)
Replaced with:
\1 \2
DisgruntledGoat
2009-09-16 11:54:32
Depending on the format of the html file, you may need to use ([^>])\r\n(.+) or ([^>])\r(.+) instead.
Brian
2009-09-16 13:07:30
A:
You can use a negative lookbehind to match the line feeds
<?php
$buffer = file_get_contents('test.html');
// replace all line feeds not preceded by </div>
$buffer = preg_replace('|(?<!</div>)[\r\n]|', "", $buffer);
file_put_contents('test.new.html', $buffer);
?>
Lance Rushing
2009-09-16 18:23:31
you may actually want something more like (?<!</[^>]+>)(\r?\n){2,}i.e. any closing tag with more than 1 CRLF (where CR is optional)
Neel
2009-09-29 11:29:53