tags:

views:

59

answers:

4

For example if I have this html:

<div>this is a test < text</div>

the < after the test is an error and the right html should be

<div>this is a test &lt; text</div>

But I have a lot of html files that by error were not encoded and i need fix this error so i can parse them later. The original source of data is not available so the only option is to fix this html I have.

Well, the sames applies to the > character and to text that has both < and > characters Like "<2000> - <2004>". I would like to hear ideas for algorithms or libraries that can help me. Thanks.

Note: the html sample above is a sample and the work should be done on big html files.

+1  A: 

I'd suggest this:

identify and map locations of all known tags, like <div> and </a>. Replace < and > everywhere outside the map you built in step 1.

Pavel Radzivilovsky
A: 

Using a "relaxed" HTML parser like the HTML Agility Pack for .NET would be a nice fit. You grab the tree as interpreted by the library, and then, in each node value, replace < and > for their proper counterparts.

See here for an example: http://stackoverflow.com/questions/118654/iron-python-beautiful-soup-win32-app/170856#170856

Vinko Vrsalovic
i dont think this will be possible , since replacing the < and > in each node will actually replace the child nodes and in the end i will be having a single body with big string of escaped childs
Karim
Nope, that won't happen as the tree is built based on recognized tags, and the actual tags are not modified with the node values. But feel free to use a more tedious and error prone approach :)
Vinko Vrsalovic
A: 

A slow way to do it would be to treat each HTML file as an XML file. Then parse through each one of the nodes of that XML file and do a Server.HTMLEnocde on the contents of the node. Since HTML is just a defined set of XML this should work.

Avitus
this wnt be possible since this html wont be considered as a valid xml. even if using tools like htmlagility pack then its not valid since it will treat this unescaped < as tags
Karim
+1  A: 

1) For all known html tags, replace <> with some other characters like {{{ and }}}. You can use regex more or less like this:

Regex.Replace(source,"</?((b|a|i|table|td|all|other|known|html|tags)( [^>]*))>","{{{$1}}}");

2) replace < with < and > with >

3) Replace {{{ with < and }}} with >

yu_sha