ansaurus

Question

regex to escape non-html tags' angle brackets

Answer 1

+3 A:

I would suggest you to use Html Cleaner

If you look at the HomePage the example shows exactly how text is escaped.

<td><a href=index.html>1 -> Home Page</a>

is converted in

<td>
   <a href="index.html">1 -&gt; Home Page</a>
</td>

it will normalize your html to conform to standard xHtml. I used it in the past and (IMHO) it's pretty solid and more reliable than jTidy&Co. (and of course it's better then use regex or replace strategies...)

al nik 2010-03-22 15:40:43

Answer 2

+1 A:

Please see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 and don't use regex to parse html. Use a SGML parser but don't use regex. It would fail to often. HTML isn't a regular language.

neo 2010-03-22 15:43:05

Answer 3

A:

If it were not for CSS, Javascript, and CData sections, it would be possible.

If you are only dealing with a subset of HTML, you could make the assumption that angle brackets not surrounded by valid element identifier characters can be encoded.

Something like "<(?=[^A-Za-z_:0-9/])" -> "<" and "(?<=[^A-Za-z_:0-9/])>" -> ">"

But, unless you are generating the HTML yourself and KNOW that it has no embedded CSS, javascript, CData, or object sections...

As fraido said, don't use regular expressions for non-regular languages.

Computer Linguist 2010-03-22 16:04:56

ansaurus

tags:

views:

answers:

regex to escape non-html tags' angle brackets

related questions