views:

82

answers:

1

I have written c# code which utilizes the HtmlAgilityPack library in order to scrape a page located at: World's Largest Urban Areas (Page 2). Unfortunately the page consists of malformed content.

I'm at an impasse on how to scrape this page. The current code I have (appearing below) freezes on parsing the HTML:

 HtmlNodeCollection cityRecords = _htmlDocument.DocumentNode.SelectNodes("//table[@class='boldtable']//tr[position() != 1]");
 CityNodes = (from node in cityRecords.Descendants()
              where node.Name == "td"
              select node).ToList();

The goal is to parse each and every city listed on the page with each of the data points; nothing more. Looking for recommendations on how to modify the above code or use another freely available library.

Thanks!

+2  A: 

Run the content through HTML Tidy before parsing it.

http://tidy.sourceforge.net/

Joshua Drake
Thank you for response so far. Do you know if there is an equivalent .Net library? I'd like to have the application download an HTML page (not just the one I cite in my question), run Html Tidy or an equivalent, and then process.
Joe
I'm not aware of a native one, but COM Interop should not be too difficult as long as speed is not a major issue.http://www.devx.com/dotnet/Article/20505/0/page/2 is one link.
Joshua Drake
I have found one, but I know almost nothing about it:http://sourceforge.net/projects/tidynet/
Joshua Drake