views:

938

answers:

4

I want to scrape the user pages of SO to give the owners of my toolbar the updated information on their questions/answers/etc...

This means I need to do this in the background, parse the pages, extract the content, compare it with the last run and then present the results either on the toolbar or the status bar, or alternatively, on a pop-up window of some kind. And all of this has to be done while the user is going about his business not being interrupted or even being on SO.

I've searched quite thoroughly both on Google and on the Mozilla Wiki for some kind of hint. I've even gone to the extent of downloading a few other extensions that I think do the same. Unfortunately I've not had the time to go through all of them and the ones I've looked at, all use data APIs(Services, WebServices, XML), not html scrapping.

Old question text

I'm looking for a nice place to learn how I can load a page inside a function called buy the infamous set_timeout() to process a screen-scraping in the background.

My idea is to present the results of such scraping in a status bar extension, just in case any thing changed from the last run.

Is there a hidden overlay or some other subterfuge?

+1  A: 

From privileged JavaScript, i.e. JS in an extension, you are allowed to create hidden iframes; downloading the specified page is as simple as setting the location on this frame.

If you're pulling down a simple, static page that you own, set_timeout should be fine. But in that case, why not use XHR?

If you're pulling down arbitrary pages, ones with dynamic elements or lots of content, I'd recommend triggering your scrape of the page using Document.onload event handlers instead. It's way more reliable, and you can get clever about scraping the page at the earliest possible moment, but when you know the required content is there.

I don't think there's a specific tutorial on this, but the Mozilla Developer Center, which I'm sure you've already found, is absolutely excellent - the best online technical documentation in my opinion!

Alabaster Codify
Does XHR on an extension permit to access other domains? Can I use Firefox DOM facilities on HTML pulled from XHR.
Gustavo Carreno
+3  A: 

I am not sure if I understood the question completely, but will try to answer a few apparent alternative questions:

If you are looking for static web page scraping BeautifulSoup (Python) is one of the best and easiest.

If you are looking for change in a Ajax based page, which changes over time, you will have to keep running the code in an infinite loop. But do not poll the site too frequently, it will detect a bandwidth consumption and may block your IP, so poll in some interval.

If you are looking to scrape some javascript rendered tickers or something, that cannot be done until the page is rendered, hence not possible with BeautifulSoup alone. you will have to use a headless browser like Crowbar - Similie (uses XULRunner) which renders the javascript content on a headless browser and the output of this rendered content can be used as an input to the BeautifulSoup scraper.

JV
I have to do it inside a toolbar that's an extension of Firefox. Refer to my clarification.
Gustavo Carreno
A: 

Have a look at XMLHttpRequest, should get you started.

Mat
+4  A: 

In case of XUL/Firefox, what you need is the nsIIOService interface, which you can get like this:

var mIOS = Components.classes["@mozilla.org/network/io-service;1"].
   getService(Components.interfaces.nsIIOService);

Then you need to create a channel, and open an asynchronous link:

var channel = mIOS.newChannel(urlToOpen, 0, null);
channel.asyncOpen(new StreamListener(), channel);

The key here is the StreamListener object:

var StreamListener = function() {
    return {
        QueryInterface: function(aIID) {
            if (aIID.equals(Components.interfaces.nsIStreamListener) ||
                aIID.equals(Components.interfaces.nsISupportsWeakReference) ||
                aIID.equals(Components.interfaces.nsISupports))
                return this;
            throw Components.results.NS_NOINTERFACE;

        onStartRequest: function(aRequest, aContext)
           { return 0; },

        onStopRequest: function(aRequest, aChannel /* aContext */, aStatusCode)
           { return 9; },

        onDataAvailable: function(aRequest, aContext, aStream, aOffset, aCount)
           { return 0; }
    };
}

You have to fill in the details in the onStartRequest, onStopRequest, onDataAvailable functions, but that should be enough to get you going. You can have a look at how I used this interface in my Firefox extension (it is called IdentFavIcon, and it can be found on the mozilla add-ons site).

The part which I'm uncertain about is how you can trigger this page request from time to time, set_timeout() should probably work, though.

Edit:

  1. See example here (see section Downloading Images) for an example on how to collect downloaded data into a single variable; and
  2. See this page on how to convert an HTML source into a DOM tree.

HTH.

David Hanak
could the resulting "page" then be used as a DOM object?
Gustavo Carreno
I tried to answer this in my edit.
David Hanak