views:

2783

answers:

14

Hi I want to create a desktop app (c# prob) that scrapes or manipulates a form on a 3rd party web page. Basically I enter my data in the form in the desktop app, it goes away to the 3rd party website and, using the script or whatever in the background, enters my data there (incl my login) and clicks the submit button for me.I just want to avoid loading up the browser!

Not having done much (any!) work in this area I was wondering would a scripting language like perl, python, ruby etc allow me to do such? Or simply do it all the scraping using c# and .net? Which one is best IYO?

I was thinking script as may need to hook into the same script something from applications on different platforms (eg symbian mobile where I wouldnt be able to develop it in c# as I would the desktop version).

Its not a web app otherwise I may as well use the original site. I realise it all sounds pointless but the automation for this specific form would be a real time saver for me.

+2  A: 

IMO Perl's built in regular expression functionality and ability to manipulate text would make it a pretty good contender for screen scraping.

Galwegian
+1  A: 

PHP is a good contender due to its good Perl-Compatible Regex support and cURL library.

Ólafur Waage
+4  A: 

C# is more than suitable for your screen scraping needs. .NET's Regex functionality is really nice. However, with such a simple task, you'll be hard to find a language that doesn't do what you want relatively easily. Considering you're already programming in C#, I'd say stick with that.

The built in screen scraping functionality is also top notch.

Joey Robert
+18  A: 

Do not forget to look at BeautifulSoup, comes highly recommended.

See, for example, options-for-html-scraping. If you need to select a programming language for this task, I'd say Python.

A more direct solution to your question, see twill, a simple scripting language for Web browsing.

gimel
+5  A: 

I use C# for scraping. See the helpful HtmlAgilityPack package. For parsing pages, I either use XPATH or regular expressions. .NET can also easily handle cookies if you need that.

I've written a small class that wraps all the details of creating a WebRequest, sending it, waiting for a response, saving the cookies, handling network errors and retransmitting, etc. - the end result is that for most situations I can just call "GetRequest\PostRequest" and get an HtmlDocument back.

Roy Peled
+4  A: 

You could try using the .NET HTML Agility Pack:

http://www.codeplex.com/htmlagilitypack

"This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams)."

A: 

Awesome, thanks all. Loads of info there.

A: 

Or stick with WebClient in C# and some string manipulations.

+1  A: 

Ruby is pretty great !... try its hpricot/mechanize

Vic
+2  A: 

Groovy is very good.

Example: http://froth-and-java.blogspot.com/2007/06/html-screen-scraping-with-groovy.html

Groovy and HtmlUnit is also a very good match: http://groovy.codehaus.org/Testing+Web+Applications Htmlunit will simulate a full browser with Javascript support.

+1  A: 

HTML Agility Pack (c#)

  1. XPath is borked, the way the html is cleaned to make it xml compliant it will drop tags and you have to adjust the expression to get it to work.
  2. simple to use

Mozilla Parser (Java)

  1. Solid XPath support
  2. you have to set enviroment variables before it will work which is a pain
  3. casting between org.dom4j.Node and org.w3c.dom.Node to get different properties is a real pain
  4. dies on non-standard html (0.3 fixes this)
  5. best solution for XPath
  6. problems accessing data on Nodes in a NodeList

    use a for(int i=1;i<=list_size;i++) to get around that

Beautiful Soup (Python)

I don't have much experience but here's what I've found

  1. no XPath support
  2. nice interface to pathing html


I prefer Mozilla HTML Parser

Scott Cowan
A: 

I second the recommendation for python (or Beautiful Soup). I'm currently in the middle of a small screen-scraping project using python, and python 3's automatic handling of things like cookie authentication (through CookieJar and urllib) are greatly simplifying things. Python supports all of the more advanced features you might need (like regexes), as well as having the benefit of being able to handle projects like this quickly (not too much overhead in dealing with low level stuff). It's also relatively cross-platform.

Zxaos
+1  A: 

You could look into http://www.screen-scraper.com/

Jason Bellows
+2  A: 

We use Groovy with NekoHTML. (Also note that you can now run Groovy on Google App Engine.)

Here is some example, runnable code on the Keplar blog:

Better competitive intelligence through scraping with Groovy

Alex Dean