views:

59

answers:

1

I am scripting in python for some web automation. I know i can not automate captchas but here is what i want to do:

I want to automate everything i can up to the captcha. When i open the page (usuing urllib2) and parse it to find that it contains a captcha, i want to open the captcha using Tkinter. Now i know that i will have to save the image to my harddrive first, then open it but there is an issue before that. The captcha image that is on screen is not directly in the source anywhere. There is a variable in the source, inside some javascript, that points to another page that has the link to the image, BUT if you load that middle page, the captcha picture for that link changes, so the image associated with that javascript variable is no longer valid. It may be impossible to gather the image using this method, so please enlighten me if you have any ideas on this.

Now if I use firebug to load the page, there is a "GET" that is a direct link to the current Captcha image that i am seeing, and i'm wondering if there is anyway to make python or ullib2 see the "GET"s that are going on when a page is loaded, because if that was possible, this would be simple.

Please let me know if you have any suggestions.

+2  A: 

Of course the captcha's served by a page which will serve a new one each time (if it was repeated, then once it was solved for one fake userid, a spammer could automatically make a million!). I think you need some "screenshot" functionality to capture the image you want -- there is no cross-platform way to invoke such functionality, but each platform (or desktop manager in the case of Linux, BSD, etc) tends to have one. Or, you could automate the browser (e.g. via SeleniumRC) to "screenshot" (e.g. "print to PDF") things at the right time. (I believe what you're seeing in firebug may be misleading you because it is "showing a snapshot"... just at the html source or DOM level rather than at a screen/bitmap level).

Alex Martelli
So ive realised that it may be possible doing it this way: When python loads the url, it only loads the source, and the javascript command to load the challenge page is not executed. So i believe that I can load the challenge page and it will view it as the first load, and thus the image it points to would be valid. It all seems to work except that in the POST request there is something to do with "psig" that my post is missing, and i cannot figure out where it is coming from.
Alex