views:

54

answers:

3

What I am trying to accomplish:

  1. HTTP GET the contents of a site (say google.com)
  2. Then have some sort of hook or filter that will catch all resources that this page tries to load (for instance the CSS files, all JavaScript files, all images, all iframes, etc)

The first thing that comes in mind is to parse the downloaded page/code and extract all tags that might link to a resource, however they are very many and some of them a tricky, like the a image background declared in CSS, example:

body {background-image:url('paper.gif');} 

Also, I need to catch all resources that are intended to be loaded via JavaScript. For instance have a JS function that will generate a URL and than interpret it to load the resource.

For this reason I think having some sort of hook or filter/monitor is what I need.

The programming language does not matter (although would be nice something that works on a Unix box).

UPDATE: This needs to be an automated solution.

Thank you.

+1  A: 

The simplest way to do this would be to write a Fiddler addon.

SLaks
A: 

You can always setup a proxy like fiddler and look at the traffic - anything apart from the initial call for the page will be the additional resources that are being requested.

Oded
Doesn't solve the question of how to automate the fetching of the page, and the interpretation of the JavaScript.
Pekka
Well, the issue is that it's supposed to be an application. It's not that I will run in manually. I need to get the contents of the page and then having this content - use it. If we are going with the run in the browser path I need something like JS to gather the links.
Alexandru Luchian
+1  A: 

I am assuming you are looking for a fully automated solution.

There are several approaches to parsing the file (In all major scripting languages, wget-based, and others) but none I know of that can actually interpret JavaScript (because that's what this would be coming down to).

I think the only option you have is to set up a Firefox (or other modern browser) instance on your Unix/Linux box, feed it a URL and watch/block all outgoing connections it attempts to make. On a client PC, this is the contents of the "Net" tab in Firebug. Whether and to what extent this can be automated without actually rewriting parts of the browser, I don't know. Maybe Selenium RC or one of the other tools from the Selenium suite is a starting point.

Pekka
I am not a very big expert in this but, maybe it's better to use Gecko or WebKit libraries/engines for this purpose?
Alexandru Luchian
@Heavy Bytes, probably yes (I am not an expert myself in browser internals that deep). However, I'm pretty sure the JavaScript engine is a separate part from the rendering engine, which may cause problems building an application.
Pekka