ansaurus

Question

is it possible to fix html that has unescaped < and > characters?

Answer 1

+1 A:

I'd suggest this:

identify and map locations of all known tags, like <div> and </a>. Replace < and > everywhere outside the map you built in step 1.

Pavel Radzivilovsky 2009-12-20 20:22:45

Answer 2

A:

Using a "relaxed" HTML parser like the HTML Agility Pack for .NET would be a nice fit. You grab the tree as interpreted by the library, and then, in each node value, replace < and > for their proper counterparts.

See here for an example: http://stackoverflow.com/questions/118654/iron-python-beautiful-soup-win32-app/170856#170856

Vinko Vrsalovic 2009-12-20 20:26:05

i dont think this will be possible , since replacing the < and > in each node will actually replace the child nodes and in the end i will be having a single body with big string of escaped childs

Karim 2009-12-20 21:14:52

Nope, that won't happen as the tree is built based on recognized tags, and the actual tags are not modified with the node values. But feel free to use a more tedious and error prone approach :)

Vinko Vrsalovic 2009-12-21 06:40:56

Answer 3

A:

A slow way to do it would be to treat each HTML file as an XML file. Then parse through each one of the nodes of that XML file and do a Server.HTMLEnocde on the contents of the node. Since HTML is just a defined set of XML this should work.

Avitus 2009-12-20 20:27:49

this wnt be possible since this html wont be considered as a valid xml. even if using tools like htmlagility pack then its not valid since it will treat this unescaped < as tags

Karim 2009-12-20 21:13:27

Answer 4

+1 A:

1) For all known html tags, replace <> with some other characters like {{{ and }}}. You can use regex more or less like this:

Regex.Replace(source,"</?((b|a|i|table|td|all|other|known|html|tags)( [^>]*))>","{{{$1}}}");

2) replace < with < and > with >

3) Replace {{{ with < and }}} with >

yu_sha 2009-12-20 20:29:20

ansaurus

tags:

views:

answers:

is it possible to fix html that has unescaped < and > characters?

related questions