What I am trying to accomplish:
- HTTP GET the contents of a site (say google.com)
- Then have some sort of hook or filter that will catch all resources that this page tries to load (for instance the CSS files, all JavaScript files, all images, all iframes, etc)
The first thing that comes in mind is to parse the downloaded page/code and extract all tags that might link to a resource, however they are very many and some of them a tricky, like the a image background declared in CSS, example:
body {background-image:url('paper.gif');}
Also, I need to catch all resources that are intended to be loaded via JavaScript. For instance have a JS function that will generate a URL and than interpret it to load the resource.
For this reason I think having some sort of hook or filter/monitor is what I need.
The programming language does not matter (although would be nice something that works on a Unix box).
UPDATE: This needs to be an automated solution.
Thank you.