views:

323

answers:

3

For completely non-nefarious purposes - machine learning specifically, I'd like to download a huge dataset of CAPTCHA images. However, CAPTCHA is always implemented using some obfuscated javascript that makes getting at the actual images without a browser a non-trivial task, at least to me, who is a javascript novice.

So, can anyone give me some helpful pointers on how to download the image of the obscured word using a script completely outside of a browser? And please don't point me to a dataset of already collected obscured words - I need to collect the images from a specific website for this particular experiment.

Thanks!

Edit: Another way this question could be asked is very simple. When you click "view source" on website with complicated javascript, you see the script references, but that's all you see. However, if you click "save webpage as..." (in firefox) and then view the source of the saved webpage, the javascript will be resolved and new html and the images (at least in the case of ASIRRA and reCAPTCHA) is in the source. How can I mimic this "save webpage as..." behavior using a script? This is an important web coding question in general, so please stop questioning me on my motives with this! This is knowledge I can use from now on in all web development involving scripting and I'm sure other stack overflow visitors can as well!

A: 

Why not just get CAPTCHA yourself and generate images? reCAPTCHA's free too. http://www.captcha.net/

Update: I see you want it from a specific site but if you get your own you can tweak it to give the same kind of images as the site you're targeting.

Nick Gotch
I already have my own server and website running recaptcha, but the same problem remains. If I browse to my site, I can see the new obscured words, but if I use a terminal or a script, I can't find the location of the image to automate the download. So it's back to my original question - how do I get the image using a script directly without a browser?
JoeCool
A: 

Get in contact with the people who run the site and ask for the dataset. If you try to download many images in any suspicious way, you'll end up on their kill list rather quickly which means that you won't get anything from them anymore.

CAPTCHAs are meant to protect people against abuse and what you do will look like abuse from their point of view.

Aaron Digulla
A: 

While waiting for an answer here I kept digging and eventually figured out a sort of hacked way of getting done what I wanted.

First off, the reason this is a somewhat complicated problem (at least to a javascript novice like me) is that the images from ASIRRA are loaded onto the webpage via javascript, which is a client-side technology. This is a problem when you download the webpage using something like wget or curl because it doesn't actually run the javascript, it just downloads the source html. Therefore, you don't get the images.

However, I realized that using firefox's "Save Page As..." did exactly what I needed. It ran the javascript which loaded the images, and then it saved it all into the well-known directory structure on my hard drive. That's exactly what I wanted to automate. So... I found a firefox Add-on called "iMacros" and wrote this macro:

VERSION BUILD=6240709 RECORDER=FX
TAB T=1
URL GOTO=http://www.asirra.com/examples/ExampleService.html
SAVEAS TYPE=CPL FOLDER=C:\Cat-Dog\Downloads  FILE=*

Set to loop 10,000 times, it worked perfectly. In fact, since it was always saving to the same folder, duplicate images were overwritten (which is what I wanted).

JoeCool