ansaurus

Question

Parsing an XML/XHTML document but ignoring errors in C#

Answer 1

+3 A:

I don't believe you can relax the parsing, but you could run it through something like HTML Tidy first to let that deal with the mess.

Jon Skeet 2009-03-11 14:21:31

I gave HTML Tidy a go, but the HTML is so badly formed that it says it can't fix it without me fixing parts manually. Quite how HSBC ever employed a web developer capable of writing such a terrible website is beyond me.

Ben Hymers 2009-03-11 16:36:50

Answer 2

A:

If they are not XHTML compliant, you cannot shove the HTML into an XMLDocument object, no matter how hard you try.

If this is low volume, you can use the WebBrowserControl to create an empty HtmlDocument object and then use the Write() method of HtmlDocument to put the string you retrieved to scrape from.

Another option is mshtml.HTMLDocument, which is a bit of a pain to work with in .NET, as it is interop.

The most common type of screen scrape is using Regex, however. Once you determine the pattern you are loooking for, you can scrape over and over again.

Gregory A Beamer 2009-03-11 14:23:17

Answer 3

+6 A:

Take a look at the HTML agility pack. It allows you to extract elements of a non-XHTML-compliant web page through XPath, as if it were a well-formed XHTML document.

And for the love of Kleene, don't try to regexp a HTML page with any kind of complexities!

Pontus Gagge 2009-03-11 14:29:56

+1. If the fools at HSBC are serving a file that isn't well-formed to browsers as text/html, it's a legacy HTML file you need to parse using an HTML parser, and not XHTML at all, even if it superficially looks like it.

bobince 2009-03-11 14:47:05

Answer 4

A:

Hi,

You could just parse the page to a string and then replace the offending syntax.

string source = //get the page source as a string...
source.Replace(" ", "");
source.Replace("==", "=");

Just a thought..

Fraser 2009-03-11 14:34:40

Not quite, the second line would ruin perfectly legal parts of the code like '==' in JavaScript, and the first would mangle the content and more than likely make the HTML even less valid :)

Ben Hymers 2009-03-11 16:38:42

Answer 5

A:

> For example there is whitespace before
> the <?xml?> tag, and there are places
> where == is used instead of = between
> an attribute name and its value (e.g.
> //<li class=="lastItem">).

You can handle the with a simple replace call. The second one you can do the same principle only when all == need to be =

PoweRoy 2009-03-11 14:34:48

ansaurus

tags:

views:

answers:

Parsing an XML/XHTML document but ignoring errors in C#

related questions