ansaurus

Question

Logic for Implementing a Dynamic Web Scraper in C#

Answer 1

+2 A:

One approach is to build a stack of tags/styles/id down to the element which you want to select.

From the element you want, traverse up to the nearest id element. This way you will get rid of most of the top header etc. Then build a sequence to look for.

Example:

<html>
  <body>
    <!-- lots of html -->
    <div id="main">
       <div>
          <span>
             <div class="pricearea">
                <table> <!-- with price data -->

For the exmaple you would store in your db a sequence of: [id=main],div,span,div,table or perhaps div[class=pricearea],table.

Using styles/classes might also be used to create your path. It's your choice to look for either a tag, an attribute of a tag or a combination. You want it as accurate as possible with as few elements as possible to make it robust.

If the layout seldom changes, this would let you navigate to the same location each time.

I would also suggest you perhaps use HTML Agility Pack or something similar for the DOM parsing, as the IE control is slow.

Screen scraping is fun, but it's difficult to get it 100% for all pages. Good luck!

Mikael Svenson 2010-01-23 08:57:20

@Mikael . Thanks for your lucid explanation Mikael. Even i had a similar kind of thought. But while trying to access the scrapped element (traversing downwards) even the siblings are fetched. for example , if there are two tables with in the div 'pricearea' or rather two rows within the first table , the logic of persisting the nearest parent doesnt work.

vijaysylvester 2010-02-08 06:10:20

You could add an indexer logic to your parsing div,table[class=something,no=2]. The same way you append tags/styles to a path, you can add indexers as well. If you paste some sample html and the element you want to grab, I can try to write an example.

Mikael Svenson 2010-02-08 15:02:59

Answer 2

+1 A:

After a bit of googling , i encountered a fairly simple solution . Below attached is the sample snippet.

if (webBrowser.Document != null)
        {
            IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;// loads the HTML DOM
            IHTMLSelectionObject selection = HtmlDoc.selection;// Fetches the currently selected HTML Element.
            IHTMLTxtRange range = (IHTMLTxtRange)selection.createRange();
            IHTMLElement parentElement = range.parentElement();// Identifies the parent element
            targetSourceIndex = parentElement.sourceIndex;               
            //dataLocation = range.parentElement().id;                
            MessageBox.Show(range.text);//range.parentElement().sourceIndex
        }

I used a Embedded Web Browser in a Winforms applications, which loads the HTML DOM of the current web page.

The IHTMLElement instance exposes a property named 'SourceIndex' which allocates a unique id to each of the html elements.

One can store this SourceIndex to the DB and Query for the content at that location. using the following code.

if (webBrowser.Document != null)
            {
                IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;
                IHTMLElement targetElement = null;
                foreach (IHTMLElement domElement in HtmlDoc.all)
                {
                    if (domElement.sourceIndex == int.Parse(node.InnerText))// fetching the persisted data from the XML file.
                    {
                        targetElement = domElement;
                        break;
                    }
                }

                MessageBox.Show(targetElement.innerText); //range.parentElement().sourceIndex
            }

vijaysylvester 2010-02-09 04:48:35

This will work, but it is very error prone. One new tag, and the index change. Using mshtml is also slow, and I would not recommend using it on server side code.

Mikael Svenson 2010-02-10 15:30:09

ansaurus

tags:

views:

answers:

Logic for Implementing a Dynamic Web Scraper in C#

related questions