views:

412

answers:

2

I have a web page that has the following content (I've changed the URL in the src tag for privacy purposes, otherwise viewing the page source is identical):

<HTML>
<BODY>

<script type="text/javascript" src="http://localhost/servlet?publicKey=abcdefg12345678&amp;amp"&gt;&lt;/script&gt;

</BODY>
</HTML>

The resulting page displays an image when viewed in a browser and I'm trying to scrape that image. After I scrape the image I attempt to index the images (see www.tineye.com for the idea of image search engine) and store them. If anybody knows how to scrape images from such web sites please let me know.

Note: the src does not contain ANY information about the image... it only calls the given servlet with a public key as the parameter. What I've posted above is EXACTLY what I see when I click View->Page Source in my browser (Firefox). Of course I've changed the actual URL and the public key for privacy issues, otherwise everything is identical.

I've seem similar techniques used for some banners: http://coldjava.hypermart.net/servlets/banner.htm

+1  A: 

The JavaScript is probably manipulating the DOM and adding an image. Therefore the image (.jpg, .png or .gif) should be somewhere inside the JavaScript file, and should look something like this:

var image = new Image("/path/to/image.jpg");

You can use Regular Expressions to filter the path and filename out of the javascript code.

Luca Matteis
OK, I updated the post to reflect what's going on. When I'm in Firefox and I press View->Page Source then I'm shown the exact source code as shown above. I had originally modified the url a bit too much in order to protect some private information, but I've changed it to look more like what it looks in reality now. There is nothing else in the page source, the 5 lines that you see above is all I see when I view the page source.
Lirik
Have you tried downloading the html file with a download manager (not firefox) and had a look into the source?
svens
@svens I have saved the page locally, I viewed the source in notepad++ and there is nothing different. It's identical to what I see in firefox too.
Lirik
Use firebug to inspect the DOM after the image is showing. If its shown via HTML, you should see it there. Then its a matter of writing some JS to find that DOM node. (if its shown via flash/activex/etc then this approach won't work)
Frank Schwieterman
@Frank thank you VERY MUCH! After opening up the source code in firebug I was able to see the javascript code and I was able to figure out the variables required to get the image! Once I had the right tools, then all the other comments and answers made sense! :)
Lirik
+1  A: 

Instead of saving a local copy of the HTML file, you should save a local copy of the JavaScript file to see how exactly it's adding the image to the HTML file's DOM. That should let you figure out how to construct requests to get the images you need.

Will McCutchen