views:

54

answers:

1

Hi, I have HTML in a CDATA element (HTML is too crappy to be parsed) and I would like to remove <a href> tags, but keep text in the tags.

I'm searching around regex but still not find a good way to do that.

All advices are welcome!

+1  A: 

You could remove anything from a string that looks like a HTML link via regex. Results heavily depend on your input, but replacing </?a\b[^>]*> with the empty string could get you pretty far.

In any case, handling HTML with regular expressions is crappy and ad-hoc. If your input data set is limited and well known and all you need to do is some throw-away one-time conversion code then crappy and ad-hoc may be enough and you could get away with it.

If you are developing code that is intended to be of the long-lived sort, you should definitely look into one of the avilable HTML parsers (BeautifulSoup for Python or the HTML Agility Pack for .NET come to mind) and not only handle your HTML in a structured way, but also fix it while you are at it.

Tomalak
Thanks for your answer. In fact this just a one-shot export so it's not a problem.I'm searching now for a xslt 2 processor supporting replace function on Ubuntu.
pvledoux
http://saxon.sourceforge.net/
Tomalak