views:

1029

answers:

4

I am looking for a quick way to parse HTML tags out of a Coldfusion string. We are pulling in an RSS feed that that could potentially have anything in it. We are then doing some manipulation of the information and then spitting it back out to another place. Currently we are doing this with a regular expression. Is there a better way to do this?

<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
  <cfset myFeed.item[i].description.value = 
   REReplaceNoCase(myFeed.item[i].description.value, '<(.|\n)*?>', '', 'ALL')>
</cfloop>

We are using Coldfusion 8.

+6  A: 
Tomalak
I've found <[^>]*> as a possible modified regex. What advantage does the 2nd half of yours provide?
Jason
As I said: It catches unclosed tags at the end of the string. "(?:>|$)" reads as "either a closing tag bracket, or the end of the string". The rest of the regex is equivalent to the alternative you've found. "[^>]*" is generally more recommendable than "(.|\n)*?", because it's more explicit and it's faster.
Tomalak
I'd recommend doing a second pass to replace < with < and > with >, because you might have some leftovers.
Kip
Agreed. Well, thinking about it, with this regex there will be no opening pointy brackets whatsoever left. Closing ones, maybe, if the input is really evil. These can be replaced with ">". A pass for "<" will be unnecessary.
Tomalak
+2  A: 

The best way is usually to coerce < to &lt; and > to &gt;. This way you aren't making assumptions about the nature of the message. Somebody may be talking about <tags> or trying to be <<expressive>> or describing a keystroke <Ctrl>+C or using maths 1 < x > 3. Even smilies could trigger the regex <8P X>

<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
    <cfset myFeed.item[i].description.value = ReplaceList(myFeed.item[i].description.value, '<,>', '&lt;,&gt;')>
</cfloop>
SpliFF
@SpliFF: Regarding "Even smilies could trigger the regex" - no , they couldn't. They'd be encoded as "<8P".
Tomalak
Yeah, he says his data is coming from an RSS feed. if the feed is proper, the only bare < and > would be from tags, the others would be < and >. It is quite possible the source feed could be malformed, but that would be the provider's problem (which would probably mess up any feed parser).
Kip
Is <![CDATA[ <this> ]]> not valid in an RSS feed description?
Peter Boughton
@Peter: Most RSS readers would interpret that as HTML, and since there is no "this" tag, your results might vary from reader to reader. It would be <![CDATA[ <b>bold</b>, x < 5 ]]> using CDATA, or without using CDATA you'd actually have to double-encode, i.e.: <b>bold</b>, x < 5
Kip
Kip, my point is that with CDATA you're again potentially dealing with non-XML HTML, which brings non-tag < and > back into the picture.
Peter Boughton
+2  A: 

HTML is not a Regular language, so using Regular expressions on (uncontrolled) HTML is something that should be done with great care (if at all).

Consider, for example, the following valid segment of HTML:

<img src="boat.jpg" alt="a boat" title="My boat is > everything! I <3 my boat!">

You'll note how the syntax highlighter is choking on that - as will the existing regex that has been offered.

Unless you can be certain that the string you are processing will not contain HTML code similar to the above, you should avoid making assumptions/compromise, which a single/pure regex route would force you to do.

(Note: The same problem applies to the suggested char-by-char method too.)


To solve your problem, you should use a DOM parser to parse your string into a HTML object, looping through each element and converting to text.

If you have valid XHTML then you can use CF's XmlParse() to produce the object which you can then loop though. If it might be non-XML HTML then there's no built-in option with CF8, so you'll have to investigate options in Java/etc.

Peter Boughton
A: 

cflib is your friend: stripHTML

rip747