ansaurus

Question

What is the best way to crawl a login based sites?

Answer 1

A:

Free Download Manager is great for crawling, and you could use wget.

Zach 2009-11-20 04:41:19

He is not asking for a software..

Shoban 2009-11-20 04:48:12

Answer 2

A:

Try a Selenium script, automated with Selenium Remote Control.

cxfx 2009-11-20 04:47:59

Answer 3

+4 A:

I use scrapy.org, it's a python library. It's quiet good actually. Easy to write spiders and it's very extensive in it's functionality. Scraping sites after login is available in the package.

Here is an example of a spider that would crawl a site after authentication.

class LoginSpider(BaseSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return [FormRequest.from_response(response,
                formdata={'username': 'john', 'password': 'secret'},
                callback=self.after_login)]

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

        # continue scraping with authenticated session...

tarasm 2009-11-20 04:49:53

What happens the url is emiting javascript like document.writeln to fill the browser document? Is Scrapy works in this case?

Vadi 2009-11-20 13:19:05

There are 2 scenarios that I can think of

tarasm 2009-11-20 14:17:58

1. All of the data is in the page when the page loads, but it's in js instead of html(this is pretty unlikely). But if this is the case, then I believe you can parse it and scrapy has some functionality for as vaguely indicated here: http://doc.scrapy.org/intro/overview.html?highlight=javascript#what-else

tarasm 2009-11-20 14:39:59

2. Data is loaded with ajax. This is a lot more likely. In this case, you can figure out what requests are made to query the data and simulate those requests directly without js.

tarasm 2009-11-20 14:40:59

Answer 4

+3 A:

I used mechanize for Python with success for a few things. It's easy to use and supports HTTP authentication, form handling, cookies, automatic HTTP redirection (30X), ... Basically the only thing missing is JavaScript, but if you need to rely on JS you're pretty much screwed anyway.

paprika 2009-11-20 11:27:50

ansaurus

tags:

views:

answers:

What is the best way to crawl a login based sites?

related questions