ansaurus

Question

Regular Expression to escape HTML ampersands while respecting CDATA

Answer 1

A:

I've done something similar here:
http://stackoverflow.com/questions/157646/best-way-to-encode-text-data-for-xml

Fortunately, in my case CDATA wasn't a problem.

What is a problem is that you have to be careful that the expression isn't greedy or you'll end up with something like this:

.... <words> are < safe! >

Joel Coehoorn 2009-01-20 20:09:01

Answer 2

+1 A:

I seriously doubt that what you are trying to accomplish is something you can do using a regular expression alone. Regexps are notoriously bad at correctly handing nesting.

You will probably be better off using an XML parser and not escaping CDATA content.

pilif 2009-01-20 22:00:56

Answer 3

+3 A:

Don't use regular expressions for this. It is a terrible, terrible idea. Instead, simply HTML encode anything that you're outputting that might have a character in it. Like this:

require 'cgi'
print CGI.escape("All of this is HTML encoded!")

Evan Fosmark 2009-01-21 03:28:30

Ben Blank 2009-01-22 03:47:14

Nick 2009-01-22 18:55:29

Whoops. I should have said unescape instead.

Evan Fosmark 2009-01-23 04:59:23

Answer 4

+4 A:

You asked for it! :D

/&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)
 (?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)/xm

The first line is your original regex. The lookahead matches if there's a CDATA closing sequence ( ]]> ) up ahead, unless there's an opening sequence ( <!CDATA[ ) between here and there. Assuming the document is minimally well formed, that should mean the current position is inside a CDATA section.

Oops, I had that backward: by using positive lookahead I was matching "naked" ampersands only within CDATA sections. I changed it to a negative lookahead, so now it works right.

By the way, this regex works in RegexBuddy in Ruby mode, but not at the rubular site. I suspect Rubular uses an older version of Ruby with less-powerful regex support; can anyone confirm that? (As you may have guessed, I'm not a Ruby programmer.)

EDIT: The problem at Rubular was that I used 's' as a modifier (to mean dot-matches-everything), but Ruby uses 'm' for that.

Alan Moore 2009-01-21 22:49:29

Nice solution. This took me quite a while to grok. Here is a detailed explanation if anyone else is interested: http://bitkickers.blogspot.com/2009/01/regular-expression-negative-lookahead_31.html

Chase Seibert 2009-01-31 16:46:15

"I think this is self-explanatory. See you next time!" :D

Alan Moore 2009-02-01 00:07:37

Answer 5

A:

That worked! At Rubular I had to change the options from /xs to /m (and I removed the whitespace that separates the two parts of the regex as you showed it above).

You can see this regular expression in action along with a sample string at http://www.rubular.com/regexes/5855.

In case that Rubular permalink isn't really permanent, here is what I entered for the regular expression:

/&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)(?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)/m

And here is the test string:

<p>a & b</p>
<p>c &amp; d</p>
<script type="text/javascript">
  // <![CDATA[
  if (a && b) doSomething('a & b &amp; c');
  // ]]>
</script>
<p>a & b</p>
<p>c &amp; d</p>

Only two ampersands match -- the a & b at the top and the a & b at the bottom. Ampersands already escaped as & and all ampersands (escaped or not) between <![CDATA[ and ]]> are left alone.

So, my final code is now this:

html.gsub(/&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)(?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)/m, '&amp;')

Thank you very much Alan. This is exactly what I needed.

Nick 2009-01-22 18:43:45

Ach! I keep forgetting about Ruby using the 'm' modifier to mean what everybody else uses 's' for. I'll fix that.

Alan Moore 2009-01-23 01:59:34

ansaurus

tags:

views:

answers:

Regular Expression to escape HTML ampersands while respecting CDATA

related questions