views:

675

answers:

5

How do I repair malformed HTML using C#? A great answer would be an HTML Agility Pack sample!


I'm scraping a site (for legitimate use). The site's HTML is OK but there are some annoying problems.

One way I could go would be through regular expressions. I used Expression Web to analyse the problems and the regular expressions needed to correct them. So one way would be to use a tool such as RegexBuddy to generate C# code for these regular expressions.

However, the recommended tool for processing malformed HTML in C# is the HTML Agility Pack (HAP). Moreover, I've analysed only a handful of pages and I'm afraid that future pages will contain patterns I've not yet solved, and I would hate to enter the "find the errors in the next few pages and correct them" maintenance business. So, if HAP already has a solid, always-working solution, this would be great. The problem is that except for a few mentions here at SO I could not find any how-to-use documentation for this tool, except for the object-by-object API help file.

So - before I spend $ and learning time on RegexBuddy (no free evaluation version), or break my teeth on HAP's API documentation - is there an easy way to do this? An HAP sample would help... :-)

+1  A: 

Regex can't be used for HTML Cleaning. Does http://tidy.sourceforge.net/ helps?

Priyank Bolia
+1  A: 

If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes. It doesn't matter if you're using the regex <td color="red">\d+</td> to get the big red number from a page or if you're using a DOM parser to get the 3rd cell in the 2nd row in the table with id numbers to get the same. The regex breaks if the webmaster replaces the color attribute with a class attribute. The DOM parser breaks if the webmaster adds another row to the top of the table.

If you're scraping larger parts of a web page and want to embed them in your own web page, it may be easier to get over your desire for web standards compliance and just let the browser figure out how to display things.

Jan Goyvaerts
+1  A: 

Since you're using Html Agility Pack and know of the problems that occur, if you are limited to this known site, why not write your scraper to adjust the problems when you've loaded the HtmlDocument.

i.e.: If you know the element always appears after the , insert the element into the first child position of the tag.....

Pat
+2  A: 

can you tell me what kind of annoying problems are you having?
but you dont need to use regex to clean the html, HAP will let you access the elemtents of a malformed html using Xpath Queries.
and basically you need to learn Xpath to know how to get the html elements you want.
it really depends on the kind of html you are parsing using HAP.
but there is several ways to get the elements.
like by id or class or even you can get the element that follows another element that contain a given text like "name:" for example.
you can goto W3 schools Xpath Tutorial for a nice xpath tutorial

Karim
+1  A: 

What I took from the answers here: 1) If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes. 2) If you are limited to this known site, why not write your scraper to adjust the problems

So, if I have to go into maintenance mode, it should be as easy as possible. Therefore, my process is as follows:

  1. I use Webius's SWExplorerAutomation to detect scenes in Web pages. The idea is that a Scene is a collection of conditions you define for IE. When a web page is loaded, IE tries to see which set of conditions is met (e.g. - page title is "Account Login", the page contains a "Login" text box a "Password" text box). If a set of conditions corresponding to a scene is detected, IE reports that the scene has been detected. This model provides an abstraction layer - Some changes in the web page can translate to changes in the scene file, saving the code from having to change. Additionally, this shields me from IE's event driven model: I call "scene. I'm evaluating this product but I'm not yet sure I'll use it, mainly because the documentation is terrible. Another alternative is Watin, and one more reason I haven't yet bought SWEA is this article accusing its author of spamming against Watin.
  2. Once the web page has been acquired, I use Expression Web to run compatibility checks and identify errors.
  3. I use RegexMagic to remove and correct errors. I really love this tool. Sure, sometimes it make you murderously angry because it doesn't let you do things that should be really easy, but it's a sweet, sweet tool, and the documentation is amazing.
  4. Finally, after all the errors I know have been corrected, I use HTML Agility Pack to convert to XHTML - cross the ts and dot the is, so to speak: all lower case, quotes across attributes, and so on.

Hope this helps!

Avi

Avi
I've just go to Wikipedia to look for info about WatiN.The page shows: (http://en.wikipedia.org/wiki/WatiN)"20:27, 17 August 2008 Alexf (talk | contribs) deleted "WatiN" ‎ (A7 (web): Web content which doesn't indicate its importance or significance)"Alex Furman who is the creator of the SWExplorerAutomation tool.I really hope these two are not the same person!
Avi
Sorry, couldn't edit the English mistakes: "I've just gone to Wikipedia", "There is an Alex Furman who is".
Avi