views:

260

answers:

3

I'm trying to find all instances of an advert on a website. The advert is in an iframe which is loaded by javascript (it doesn't appear at all if javascript is turned off). Detecting the advert itself is extremely simple, both the name of the flash file and the target of the href always contain a certain string.

What would be the best "starting point" for achieving this? At the moment I'm considering an Adobe AIR app, which could crawl the site and examine the DOM to find the ad, and would run javascript and load the content of the iframe. The other option I can think of is using Firefox as the platform (using maybe GreaseMonkey or Selenium? I don't really know how to leverage Firefox like this).

Does anyone know of anything suitable to build this, or have any suggestions on using Firefox to do it?


Extra details:

Being CPU intensive isn't really an issue, nor is anything depending on a browser being open. This doesn't need to run on a headless server, it will be running on a powerful desktop box. OS is also not an issue. It would be advantageous if the crawler loaded each page multiple times, as the advert is in rotation. While the crawler does need to execute the javascript and load the content of the iframe, it does not need to be able to display flash files.

+1  A: 

If the ad is only displayed when javascript is enabled, you are going to have a problem, as no crawler is going to be able to read the web page in that matter.

Is there something in the javascript code itself that could be a tipoff to where the add is displayed? If so, maybe you can check that.

I've tried similar stuff before, and I used BeautifulSoup in python, and it worked really well.

GSto
Unfortunately the javascript doesn't seem to provide any clues - it's like advert=348uyhy283tg4h8237. Hence my idea to script something in-browser.
ZoFreX
P.S. beautiful soup looks awesome. I've heard it mentioned before but never looked into it. I don't think it will help in this particular instance but I'm sure you've potentially saved me hours in future projects! Once I learn Python of course...
ZoFreX
+1  A: 

I think You don't want a crawler. You are going to run it on a single page and not want it to look around the internet through links, right?

If so - You want to find something on the page with javascript on. You then just have to use javascript.

You'll need:

  1. the site :)
  2. correct rights to access its content - use greasemonkey for FF or user scripts in Opera
  3. a code similar to this jQuery sampe:

finding stuff in iframes:

$('iframe').each(function(){
     $(this).contents().find('object').each(function(){
      if($(this).attr('name').match(/regex/)){
        $(this).remove(); //or do whatever You want
       }
      });
    });

caution: accessing iframe contents may differ in browsers and is influenced by time when You run the script

naugtur
This is halfway there but I do need it to crawl pages, albeit only on one website. Would that be possible with greasemonkey or user scripts?
ZoFreX
In this case the browser is used, so if You open a page in the browser the script does what it's supposed. It doesn't if You dont open the page. If You are really interested in crawling the page automatically You might accomplish it with javascript window.location.href but when You leave a page the changes are not saved. I don't know what You want to find the ads for - it's hard to guess ;)
naugtur
+2  A: 

An alternative to using a "browser as a crawler" is using HTMLUnit as the page says, it's:

HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.

It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.

sw
This looks awesome. Pretty sure I can get it to do what I need. Thanks!
ZoFreX