views:

1679

answers:

5

While HTML Scraping is pretty well-documented from what I can see, and I understand the concept and implementation of it, what is the best method for scraping from content that is tucked away behind authentication forms. I refer to scraping from content that I legitimately have access to, so a method for automatically submitting login data is what I'm looking for.

All I can think of is setting up a proxy, capturing the throughput from a manual login, then setting up a script to spoof that throughput as part of the HTML scraping execution. As far as language goes, it would likely be done in Perl.

Has anyone had experience with this, or just a general thought?

Edit This has been answered before but with .NET. While it validates how I think it should be done, does anyone have Perl script to do this?

A: 

Yes, you can use other libraries for your own language if it other than asp.net.

For example, in Java you can use httpclient or httpunit (that even handles some basic Javascript).

Guido
+3  A: 

The LWP Module in perl should give you what you're after.

There's a good article here which talks about enabling cookies and other authentication methods to get you an authorised login and allow your screen scrape to get you behind the log-in wall.

ConroyP
+2  A: 

There are 2 types of authentication that are regularly used. HTTP-based authentication and form-based authentication.

For a site that uses HTTP based authentication you basically send the username and password as part of each HTTP request you make to the server.

For a site that does form-based authentication you usually need to visit the login page, accept and store the cookie, then submit the cookie information with any HTTP requests you make.

Of course there are also sites like stackoverflow that use external authentication like openid, or saml for authentication. These are more complex to deal with for scrapping. Usually you want to find a library to handle them.

Zoredache
+4  A: 

Check out the Perl WWW::Mechanize library - it builds on LWP to provide tools for doing exactly the kind of interaction you refer to, and it can maintain state with cookies while you're about it!

WWW::Mechanize, or Mech for short, helps you automate interaction with a website. It supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited.

Paul Dixon
A: 

I've tried to use WWW::Mechanize, and I think it's a GREAT library. It's very powerful and very easy to use.

The only problem is, it doesn't support javascript. Does anyone know of a similar solution that does? I am trying to login/scrape on a site that requires javascript and just displays an error if it's disabled.

[I guess adding to this question is how I should do this? This is my first post, and I remember Jeff talking about having some way to merge questions or something...If I should post a new question, let me know]