ansaurus

Question

Regular expression to replace line feeds with a space only if the break is not in the contents of an HTML attribute

Answer 1

+1 A:

Ideally you would use a real HTML parser (or XML it it was XHTML) and replace the attribute contents with that.

However, the following may do the trick if the engine supports positive lookbehind of arbitrary length:

(?<=\<[^<>]+=\s*("[^"]*|'[^']*))[\r\n]+

Usage: Replace all occurences of this regex with an empty string.

Lucero 2010-10-21 23:33:28

Thanks, I'll give this a shot. Did you have any particular engine in mind?

Rod Boev 2010-10-22 00:10:03

The .NET engine works well for this, Java doesn't (at least not the last time I tried), not sure about PCRE and others. Just try it - if it doesn't work, you may still be able to convert the expression to be one match and just trim the cr/nl characters at the end of the match and use that as replacement, use something like `(\<[^<>]+=\s*(?:"[^"]*|'[^']*))[\r\n]+` as pattern and `$1` (or whatever the engine uses to reference a capture group) as replacement pattern.

Lucero 2010-10-22 00:14:44

The only flavors that support unbounded lookbehind are .NET and JGSoft (EditPad Pro, PowerGrep). But you can use a lookahead instead; see my answer.

Alan Moore 2010-10-22 01:30:21

Answer 2

+1 A:

This will remove newlines in attribute values, assuming the values are enclosed in double-quotes:

$s = preg_replace(
       '/[\r\n]+(?=[^<>"]*+"(?:[^<>"]*+"[^"<>]*+")*+[^<>"]*+>)/',
       '', $s);

The lookahead asserts that, between the current position (where the newline was found) and the next >, there's an odd number of double-quotes. This doesn't allow for single-quoted values, or for angle brackets inside the values; both can be accommodated if need be, but this is ugly enough already. ;)

After that, you can replace any remaining newlines with spaces:

$s = preg_replace('/[\r\n]+/', ' ', $s);

See it in action on ideone.com.

Alan Moore 2010-10-22 01:27:43

This is working great so far with my test files, only issue is that it's removing all newlines instead of the ones between "CONTENT" and "END CONTENT". Is it best to manually process that limit in PHP or build it into the regex?

Rod Boev 2010-10-22 18:11:38

I'd do that separately. This regex is complicated enough already.

Alan Moore 2010-10-22 19:14:17

Agreed, this regex is great. Thanks very much Alan!

Rod Boev 2010-10-22 21:17:22

ansaurus

tags:

views:

answers:

Regular expression to replace line feeds with a space only if the break is not in the contents of an HTML attribute

related questions