tags:

views:

432

answers:

5

I'm writing some little applications that parse the source of a few web pages, extract some data, and save it into another format. Specifically, some of my banks don't provide downloads of transactions/statements but they do provide access to those statements on their websites.

I've done one fine, but another (HSBC UK) is proving a pain in the arse, since its source is not valid XHTML. For example there is whitespace before the <?xml?> tag, and there are places where == is used instead of = between an attribute name and its value (e.g. <li class=="lastItem">).

Of course, when I pass this data into my XmlDocument, it throws a wobbly (more accurately an exception).

My question is: is it possible to relax the requirements for XML parsing in C#? I know it's far better to fix these problems at source - that's absolutely my attitude too - but there's roughly zero chance HSBC would change their website which already works in most browsers just for little old me.

+3  A: 

I don't believe you can relax the parsing, but you could run it through something like HTML Tidy first to let that deal with the mess.

Jon Skeet
I gave HTML Tidy a go, but the HTML is so badly formed that it says it can't fix it without me fixing parts manually. Quite how HSBC ever employed a web developer capable of writing such a terrible website is beyond me.
Ben Hymers
A: 

If they are not XHTML compliant, you cannot shove the HTML into an XMLDocument object, no matter how hard you try.

If this is low volume, you can use the WebBrowserControl to create an empty HtmlDocument object and then use the Write() method of HtmlDocument to put the string you retrieved to scrape from.

Another option is mshtml.HTMLDocument, which is a bit of a pain to work with in .NET, as it is interop.

The most common type of screen scrape is using Regex, however. Once you determine the pattern you are loooking for, you can scrape over and over again.

Gregory A Beamer
+6  A: 

Take a look at the HTML agility pack. It allows you to extract elements of a non-XHTML-compliant web page through XPath, as if it were a well-formed XHTML document.

And for the love of Kleene, don't try to regexp a HTML page with any kind of complexities!

Pontus Gagge
+1. If the fools at HSBC are serving a file that isn't well-formed to browsers as text/html, it's a legacy HTML file you need to parse using an HTML parser, and not XHTML at all, even if it superficially looks like it.
bobince
A: 

Hi,

You could just parse the page to a string and then replace the offending syntax.

string source = //get the page source as a string...
source.Replace(" ", "");
source.Replace("==", "=");

Just a thought..

Fraser
Not quite, the second line would ruin perfectly legal parts of the code like '==' in JavaScript, and the first would mangle the content and more than likely make the HTML even less valid :)
Ben Hymers
A: 
> For example there is whitespace before
> the <?xml?> tag, and there are places
> where == is used instead of = between
> an attribute name and its value (e.g.
> //<li class=="lastItem">).

You can handle the with a simple replace call. The second one you can do the same principle only when all == need to be =

PoweRoy