views:

38

answers:

2

I'm trying to write a regular expression that replaces line feeds between certain areas of a text file, but only on plain text content (i.e. excludes text inside HTML attribute contents, like href) but not having much luck past the first part.

Example input:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-
link-that-breaks">This is an example.</a> This is an example. This is yet another
example.
END CONTENT
COMMENTS: 0

Example output:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-link-that-breaks"&gt;This is an example.</a> This is an example. This is yet another example.
END CONTENT
COMMENTS: 0

So ideally, a space replaces line breaks if they occur in plain text, but removes them without adding a space if they are inside HTML parameters (mostly href, and I'm fine if I have to limit it to that).

+1  A: 

Ideally you would use a real HTML parser (or XML it it was XHTML) and replace the attribute contents with that.

However, the following may do the trick if the engine supports positive lookbehind of arbitrary length:

(?<=\<[^<>]+=\s*("[^"]*|'[^']*))[\r\n]+

Usage: Replace all occurences of this regex with an empty string.

Lucero
Thanks, I'll give this a shot. Did you have any particular engine in mind?
Rod Boev
The .NET engine works well for this, Java doesn't (at least not the last time I tried), not sure about PCRE and others. Just try it - if it doesn't work, you may still be able to convert the expression to be one match and just trim the cr/nl characters at the end of the match and use that as replacement, use something like `(\<[^<>]+=\s*(?:"[^"]*|'[^']*))[\r\n]+` as pattern and `$1` (or whatever the engine uses to reference a capture group) as replacement pattern.
Lucero
The only flavors that support unbounded lookbehind are .NET and JGSoft (EditPad Pro, PowerGrep). But you can use a lookahead instead; see my answer.
Alan Moore
+1  A: 

This will remove newlines in attribute values, assuming the values are enclosed in double-quotes:

$s = preg_replace(
       '/[\r\n]+(?=[^<>"]*+"(?:[^<>"]*+"[^"<>]*+")*+[^<>"]*+>)/',
       '', $s);

The lookahead asserts that, between the current position (where the newline was found) and the next >, there's an odd number of double-quotes. This doesn't allow for single-quoted values, or for angle brackets inside the values; both can be accommodated if need be, but this is ugly enough already. ;)

After that, you can replace any remaining newlines with spaces:

$s = preg_replace('/[\r\n]+/', ' ', $s);

See it in action on ideone.com.

Alan Moore
This is working great so far with my test files, only issue is that it's removing all newlines instead of the ones between "CONTENT" and "END CONTENT". Is it best to manually process that limit in PHP or build it into the regex?
Rod Boev
I'd do that separately. This regex is complicated enough already.
Alan Moore
Agreed, this regex is great. Thanks very much Alan!
Rod Boev