Whats the best screen scraping language?

views:

2783

answers:

+4 Q:

Whats the best screen scraping language?

Hi I want to create a desktop app (c# prob) that scrapes or manipulates a form on a 3rd party web page. Basically I enter my data in the form in the desktop app, it goes away to the 3rd party website and, using the script or whatever in the background, enters my data there (incl my login) and clicks the submit button for me.I just want to avoid loading up the browser!

Not having done much (any!) work in this area I was wondering would a scripting language like perl, python, ruby etc allow me to do such? Or simply do it all the scraping using c# and .net? Which one is best IYO?

I was thinking script as may need to hook into the same script something from applications on different platforms (eg symbian mobile where I wouldnt be able to develop it in c# as I would the desktop version).

Its not a web app otherwise I may as well use the original site. I realise it all sounds pointless but the automation for this specific form would be a real time saver for me.

+2 A:

IMO Perl's built in regular expression functionality and ability to manipulate text would make it a pretty good contender for screen scraping.

Galwegian 2009-04-17 12:06:22

+1 A:

PHP is a good contender due to its good Perl-Compatible Regex support and cURL library.

Ólafur Waage 2009-04-17 12:09:17

+4 A:

C# is more than suitable for your screen scraping needs. .NET's Regex functionality is really nice. However, with such a simple task, you'll be hard to find a language that doesn't do what you want relatively easily. Considering you're already programming in C#, I'd say stick with that.

The built in screen scraping functionality is also top notch.

Joey Robert 2009-04-17 12:10:10

+18 A:

Do not forget to look at BeautifulSoup, comes highly recommended.

See, for example, options-for-html-scraping. If you need to select a programming language for this task, I'd say Python.

A more direct solution to your question, see twill, a simple scripting language for Web browsing.

gimel 2009-04-17 12:16:48

+5 A:

I use C# for scraping. See the helpful HtmlAgilityPack package. For parsing pages, I either use XPATH or regular expressions. .NET can also easily handle cookies if you need that.

I've written a small class that wraps all the details of creating a WebRequest, sending it, waiting for a response, saving the cookies, handling network errors and retransmitting, etc. - the end result is that for most situations I can just call "GetRequest\PostRequest" and get an HtmlDocument back.

Roy Peled 2009-04-17 13:16:37

+4 A:

You could try using the .NET HTML Agility Pack:

http://www.codeplex.com/htmlagilitypack

"This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams)."

2009-04-17 13:30:15

Awesome, thanks all. Loads of info there.

2009-04-17 13:34:20

Or stick with WebClient in C# and some string manipulations.

2009-04-17 13:34:39

+1 A:

Ruby is pretty great !... try its hpricot/mechanize

Vic 2009-04-17 13:38:27

+2 A:

Groovy is very good.

Example: http://froth-and-java.blogspot.com/2007/06/html-screen-scraping-with-groovy.html

Groovy and HtmlUnit is also a very good match: http://groovy.codehaus.org/Testing+Web+Applications Htmlunit will simulate a full browser with Javascript support.

2009-04-18 17:13:04

+1 A:

HTML Agility Pack (c#)

XPath is borked, the way the html is cleaned to make it xml compliant it will drop tags and you have to adjust the expression to get it to work.
simple to use

Mozilla Parser (Java)

Solid XPath support
you have to set enviroment variables before it will work which is a pain
casting between org.dom4j.Node and org.w3c.dom.Node to get different properties is a real pain
dies on non-standard html (0.3 fixes this)
best solution for XPath
problems accessing data on Nodes in a NodeList

use a for(int i=1;i<=list_size;i++) to get around that

Beautiful Soup (Python)

I don't have much experience but here's what I've found

no XPath support
nice interface to pathing html

I prefer Mozilla HTML Parser

Scott Cowan 2009-04-24 16:36:21

I second the recommendation for python (or Beautiful Soup). I'm currently in the middle of a small screen-scraping project using python, and python 3's automatic handling of things like cookie authentication (through CookieJar and urllib) are greatly simplifying things. Python supports all of the more advanced features you might need (like regexes), as well as having the benefit of being able to handle projects like this quickly (not too much overhead in dealing with low level stuff). It's also relatively cross-platform.

Zxaos 2009-04-29 07:30:12

+1 A:

You could look into http://www.screen-scraper.com/

Jason Bellows 2009-08-27 17:51:03

+2 A:

We use Groovy with NekoHTML. (Also note that you can now run Groovy on Google App Engine.)

Here is some example, runnable code on the Keplar blog:

Better competitive intelligence through scraping with Groovy

Alex Dean 2010-01-21 09:21:50

ansaurus

tags:

views:

answers:

Whats the best screen scraping language?

related questions