ansaurus

Question

Parsing Information out of a Scraped Screen (HTML)

Answer 1

+4 A:

Use the Html Agility Pack to parse the page. You can load the entire text of the page and then treat it as XML - write XPATH expressions or crawl the DOM tree to get what you need.

This allows you to avoid the problem of "scraping" at all and approach the task as you would any other XML store. Here's a very basic intro to XPATH. You could write something like myDoc.SelectSingleNode("//div[@class='header']/h2").InnerText, which means "select the H2 element which is an immediate child of the DIV whose class is 'header'", and then getting the inner text of that element.

Rex M 2009-08-16 04:28:05

I'm very, VERY green to Web Scraping. How could I apply this to my particular problem? All I need for it to do is copy the string between "X" html tag. Thank you!

Sergio Tapia 2009-08-16 04:36:34

@Papuccino see my revised answer.

Rex M 2009-08-16 04:41:31

I'll try out what you suggested. :)

Sergio Tapia 2009-08-16 04:48:11

Rex M - Curious how you'd initially retrieve the web page as an XML document so that an XmlDocument can be created?

Howiecamp 2010-01-25 23:53:12

@Howiecamp we would not create an XmlDocument from the webpage - rather we would load the entire response stream into the Html Agility Pack which creates an "XML-like" structure that behaves like XML, and can be converted to an XmlDocument.

Rex M 2010-01-26 02:57:51

Thanks Rex.....

Howiecamp 2010-01-26 04:42:31

Answer 2

+1 A:

Have a look at Wikipedia's entry on Web Scraping: Here I do a lot of web scraping, and in my experience Regular Expressions are sufficient about 80% of the time. After which, you need to look at parsing the (X)HTML and traversing the DOM tree.

Nick 2009-08-16 04:30:35

ansaurus

tags:

views:

answers:

Parsing Information out of a Scraped Screen (HTML)

related questions