views:

307

answers:

3

i came across this .net library

http://www.webzinc.com/online/faq.aspx

however, i was wondering if there was a free alternative out there ?

A: 

Building robots isn't that hard, and there are a number of books that describe the general algorithm for doing so (a simple Google search will turn up a number of algorithms).

The jist of it from a .NET perspecitve is to recursively:

  • Download pages - This is done through the HttpWebRequest/HttpWebResponse, or the WebClient classes

  • Parse the downloaded content - I use a customized version of HTML Tidy for .NET called TidyNet, located at http://sourceforge.net/projects/tidynet/). The HTML Agility Pack (found at http://www.codeplex.com/htmlagilitypack) is also very popular. Both of these will handle malformed HTML and give you a structured format to work with (a DOM, or something close to it).

  • Once you have the HTML in a structured format, scan the structure for the content that is relevant to you and process it.

    • Scan the structured format for external links and place in the queue to be processed (against whatever constraints you want for your app, you aren't indexing the entire web, are you?).

    • Get the next item in the queue, and repeat the process again.

casperOne
A: 

Alternative in terms of what? you want script or library or wanna hire somebody to screen scrape or software? Pls clarify

Bob
A: 

Nothing better than Automation Anywhere as a web data extraction tool.

Sharon