A: 

I've done something similar here:
http://stackoverflow.com/questions/157646/best-way-to-encode-text-data-for-xml

Fortunately, in my case CDATA wasn't a problem.

What is a problem is that you have to be careful that the expression isn't greedy or you'll end up with something like this:

.... <words> are < safe! >

Joel Coehoorn
+1  A: 

I seriously doubt that what you are trying to accomplish is something you can do using a regular expression alone. Regexps are notoriously bad at correctly handing nesting.

You will probably be better off using an XML parser and not escaping CDATA content.

pilif
+3  A: 

Don't use regular expressions for this. It is a terrible, terrible idea. Instead, simply HTML encode anything that you're outputting that might have a character in it. Like this:

require 'cgi'
print CGI.escape("All of this is HTML encoded!")
Evan Fosmark
Ben Blank
Nick
Whoops. I should have said unescape instead.
Evan Fosmark
+4  A: 

You asked for it! :D

/&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)
 (?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)/xm

The first line is your original regex. The lookahead matches if there's a CDATA closing sequence ( ]]> ) up ahead, unless there's an opening sequence ( <!CDATA[ ) between here and there. Assuming the document is minimally well formed, that should mean the current position is inside a CDATA section.

Oops, I had that backward: by using positive lookahead I was matching "naked" ampersands only within CDATA sections. I changed it to a negative lookahead, so now it works right.

By the way, this regex works in RegexBuddy in Ruby mode, but not at the rubular site. I suspect Rubular uses an older version of Ruby with less-powerful regex support; can anyone confirm that? (As you may have guessed, I'm not a Ruby programmer.)

EDIT: The problem at Rubular was that I used 's' as a modifier (to mean dot-matches-everything), but Ruby uses 'm' for that.

Alan Moore
Nice solution. This took me quite a while to grok. Here is a detailed explanation if anyone else is interested: http://bitkickers.blogspot.com/2009/01/regular-expression-negative-lookahead_31.html
Chase Seibert
"I think this is self-explanatory. See you next time!" :D
Alan Moore
A: 

That worked! At Rubular I had to change the options from /xs to /m (and I removed the whitespace that separates the two parts of the regex as you showed it above).

You can see this regular expression in action along with a sample string at http://www.rubular.com/regexes/5855.

In case that Rubular permalink isn't really permanent, here is what I entered for the regular expression:

/&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)(?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)/m

And here is the test string:

<p>a & b</p>
<p>c &amp; d</p>
<script type="text/javascript">
  // <![CDATA[
  if (a && b) doSomething('a & b &amp; c');
  // ]]>
</script>
<p>a & b</p>
<p>c &amp; d</p>

Only two ampersands match -- the a & b at the top and the a & b at the bottom. Ampersands already escaped as &amp; and all ampersands (escaped or not) between <![CDATA[ and ]]> are left alone.

So, my final code is now this:

html.gsub(/&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)(?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)/m, '&amp;')

Thank you very much Alan. This is exactly what I needed.

Nick
Ach! I keep forgetting about Ruby using the 'm' modifier to mean what everybody else uses 's' for. I'll fix that.
Alan Moore