views:

19332

answers:

20

Have any programming methods have been used to defeat reCAPTCHA?

I'm interested in seeing evidence and potentially demonstrations that reCAPTCHA in particular has been made obsolete by completely automated, humanless methods.

To clarify, not looking for reCAPTCHA-cheating solutions that involve humans in any way, whether teams in India/China, porn-seekers, or Mechanical Turk.

I'm also not looking for alternatives to reCAPTCHA, like picking the type of animal, or background fields or javascript trickery.

+12  A: 

The weakness of CAPTCHA systems is that people set up rooms full of people in China whose only job it is is to look at a CAPTCHA image and type in the result, which plugs into the automated system that's actually doing the spamming.

Not much you can do about that really.

It's also far cheaper than trying to do image recognition, OCR, etc on the actual image (you may get a response for under $0.01 the other way).

cletus
Or even better, they grab the captcha off your site, and show it to some wanker (literally) as a requirement to showing them some porn.
Paul Tomblin
Man... that's clever (credit where credit is due).
cletus
not only that but Amazon Web Services can enable such things. http://aws.amazon.com/mturk/
Jason S
Note that this doesn't make it an ineffective tool. It merely means that if your site is popular enough then this might happen. For the other 99.99% of the websites in the world, a simple captcha will do.
Robert P
Hell, CodingHorror's captcha doesn't even change, nor is it obfuscated, and it manages to do the job all right!
Robert P
@Robert, I wonder how that word was chosen.
BoltBait
Servers can and will ban your IP after too many account registrations. So a good sized and/or growing botnet is needed as well.
Zombies
@Paul: Spam and the like is pure evil, but that solution is so remarkably cool...if only they could be turned to the power of good!
Beska
@Paul, that is hilariously brilliant.
unforgiven3
Actually, that's not entirely true. Although there *are* examples of this, it is *FAR* cheaper to OCR-crack a CAPTCHA. Using sweat shops are usually *NOT* economically feasible for the spammers.
Jens Roland
+13  A: 

My favorite captcha is from Microsoft: http://research.microsoft.com/en-us/um/redmond/projects/asirra/

Asirra (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but our user studies have shown that people can accomplish it quickly and accurately. Many even think it's fun!

It is a free service and they have example code to get you started.

I wonder how long it will be before it is cracked.

BoltBait
Unfortunately cletus's answer above shows how such a service will be ineffectual in the greater fight against spam.
Erik Forbes
i failed that one 2 out of 4 times, a badly lit picture of a Pomeranian can look like a cat :(
Tom Anderson
I took the test and it feels good to know that I am a human. :)
BoltBait
Actually the best captcha used to be HotCaptcha - but its offline last time i checked. Based on HotOrNot.com, it wasnt horribly effective, but VERY popular with the users :-)
AviD
The issue here is that it would be very easy to brute force due to a small key space. If yuo start adding more objects to name then you get into ambiguity in naming (example, is it a Kangaroo, a Joey, or a Baby Kangaroo?). You would need to make sure you had a one to many relation between objects to be named and their possible names.
Oorang
+7  A: 

Before giving in to the pressure of using captcha, consider creative workarounds such as having a field labeled "Your Comments" that is hidden by CSS. If the field is entered, the request is dropped by the server. Most bots will fall for it even if there is still not a good way to defeat the room full of underpaid laborers, which captcha does not help with anyways.

UPDATE: Just read a case study where removing CAPTCHA increased conversion rates by almost 10%. That would indicate to me that it is rather broken if you are losing 10% of your leads just to filter out bots. Imagine what 10% means to most businesses.

DavGarcia
This is very smart but doesn't work if you're sufficiently popular. Yahoo or Google, for example, could never use this.
dreeves
The question here is whether your site is valuable enough to attack specifically. Most aren't, and having little idiosyncrasies will do some good.
David Thornley
I would +1 for the update re 10% loss - VERY important point. (but I can't +1 cuz of the hidden field suggestion - this is less than useless.)
AviD
Why is it less than useless?
metanaito
There are 2 problems "targeted attack" and "random spam". Your solution might save your ass for random spam, a targeted attack will flood your system within a day though.
dr. evil
@webdtc, its as Slough said. Useless because it's absolutely trivial for a script to get around this, less than that because of the false sense of security.
AviD
@dreeves: didnt google just acquire reCAPTCHA?
pwee167
+4  A: 

There was a speech at Defcon last year that went into the problems with CAPTCHAs in general. One of the things they did is use multiple free OCR engines and had them vote on the best words. Doing this, they were able to achieve a somewhat decent chance of succeeding. For one kind, it was 40% or so, I don't think it was reCaptcha, though.

FryGuy
That's an important point, a spam bot doesn't have to break all capthas - 1% would do if it can keep trying.
Martin Beckett
+34  A: 

I notice that almost all the answers here relate to the ineffectiveness of the concept of CAPTCHA, in principle - and while I very much agree with them, in fact gave a talk at OWASP a few months ago explaining just that - the question is very specific, so I will provide for a demonstration.
But first, I will reiterate that demonstration aside, re-read the other comments, since it's truth that CAPTCHA is pointless and not helpful, irrelevant of implementation....

But really, check out CAPTCHA Killer. You can upload a CAPTCHA image, and it will automatically, if not immediately, provide the OCR'd answer. It also provides for an API (REST, I think, but maybe also SOAP). I personally tried numerous reCAPTCHA images, and it was actually some of the easiest ones (or at least quickest) broken.

And yeah, OCR is not the best way to break a CAPTCHA protected site - there are many other better ways.

AviD
I wonder how captcha killer works. Somehow it looks to me like it's using cheap labour and making money with the advertisement on the website. (And merchandising.)
Georg
I'm pretty sure it's OCR, but I could be wrong.
AviD
Useful answer about captchas in general, but the question was about reCAPTCHA specifically.
Mike Knowles
Just tried Captcha Killer with three reCAPTCHAs. All three expired without returning an answer.
lfaraone
@Mike, reCAPTCHA is not necessarily MORE broken than CAPTCHAs in general, but all that of course applies to it too... Also, as I mentioned reCAPTCHA images were the quickest broken. @Ifaraone, I find that odd, its worked fine for me before, and as Ive said specifcally reCAPTCHA images were the quickest broken... Though I havent done it in quite a while, I'm going to check it out again.
AviD
+1  A: 

It seems at least that very few have had issues with reCAPTCHA or else they would have posted them. Have you tried asking reCAPTCHA directly (if you're not them ;)?

I've seen that there is a "scientific" article about reCAPTCHA, maybe you want to check that out:

reCAPTCHA: Human-Based Character Recognition via Web Security Measures Luis von Ahn,* Benjamin Maurer, Colin McMillen, David Abraham, Manuel Blum

tharkun
+2  A: 

Not only has it been defeated, but also a useful application has been successfully built on top of it, to become the most amazing tool to defeat all kind of free-account protections of a big list of direct download sites (not only megaupload and rapidshare).

Jdownloader is open source and written in Java so a peek at the source code can answer not only if it is broken but also how.

Edit: Most of direct download sites do not use reCaptcha, but a simpler Captcha method (3 capital letters colored in different colors). Nonetheless Jdownloader and Cryptload (a program similar to Jdownloader) are the only working implementations that I know that effectively have broken a Captcha method. I have not heard of any implementation to crack reCaptcha.

Update: It seems that at least one implementation of reCaptcha (not whole reCaptcha itself) has been cracked too.

Fernando Miguélez
Do you know which one of those filehosters use RE-captcha because rapidshare and megaupload don't.
dr. evil
+1  A: 

AFAIK In practice there is no tool to crack RE-captcha implementation, however eventually I assume someone will get it.

Funny enough if someone manages to get it then the whole RE-captcha project is pointless because re-captcha designed digitalize books which can't be done in an automated way.

BTW :

The weakness of CAPTCHA systems is that people set up rooms full of people in China whose only job it is is to look at a CAPTCHA image and type in the result, which plugs into the automated system that's actually doing the spamming.

You can't secure a system thinking like that, this is like saying "your web application is not secure enough if your host is not in a old military bunker, because now people can steal your machine".

dr. evil
Your sentiment is spot on, but the application of it is misplaced: The thinking (of the comment you quoted) is that CAPTCHA *does not solve the problem it intends to*. Or as I often say "CAPTCHA (in general) is a bad solution to the wrong problem." The problem CAPTCHA tries to solve (by definition) is: How do I know that the user is a person, not a computer? Whether or not CAPTCHA solves this (it doesnt), the REAL problem is: How can I prevent mass flooding of my service? CAPTCHA farms and proxies show the exact difference. It's why any security solution should start with the threats.
AviD
You right, it's all come down "Why are you using CAPTCHA?". For some systems it's just enough security for some systems it's not even close. But just like keysize in crypto helps you to protect something by making brute forcing take years (although eventually they are going to crack it! but not in this life time or not in next 10 years) CAPTCHA in some systems can help enough security in the very same way. So as you said it's all come down what are you using CAPTCHA for?
dr. evil
+8  A: 

reCAPTACHA isn't broken and it won't be for a very long time. The thing is, if you implement your own captcha if it's broken, it probably takes a long time to fix it.

This is taken from the page about reCAPTCHA security:

reCAPTCHA is a Web service. That means that all the images are generated and graded by our servers. (…) this also provides an extra level of protection: our CAPTCHAs can be automatically updated whenever a security vulnerability is found.

For example, if somebody writes a program that can read our distorted images, we can add more distortions in very little time, and without Web masters having to change anything on their side.

I believe as they are specialized on captchas they have improved versions stored, ready to be deployed in little time if needed. (Why should they create stronger security when the weaker isn't broken yet?)

Georg
+3  A: 

Unlike regular CAPTCHA, reCAPTCHA are images, that resulted impossible to OCR by two independent OCR applications. So it'll will only be cracked, if at some point spammers would develop OCR software far superior to the programms used by Carnegie Mellon. Which I find highly unlikely.

Furthermore, as it is a webservice, they can react to any attack in real time without need of any intervention by reCAPTCHA users.

vartec
+11  A: 

You might be interested in this detailed report on how 4chan defeated reCAPTCHA, and used it to manipulate Time.com's annual TIME 100 Poll results.

Hacking Recaptcha (aka ‘The Penis Flood’)

The next tactic used was to see if they could find a flaw in the reCAPTCHA implementation. One thing they discovered about reCAPTCHA was that it always presents two words to a user for decoding - one word is a control word known by the reCAPTCHA system, while the other is an unknown word (reCAPTCHA uses the humans to help correct OCR errors). Wikipedia describes the process: “Scanned text is subjected to analysis by two different optical character recognition programs; in cases where the programs disagree, the questionable word is converted into a CAPTCHA. The word is displayed along with a control word already known and is labeled by the human. Those words that are consistently given a single label by human judges are recycled as control words”. 2iasdo4 What Anonymous realized was that if they always labeled the unknown scanned text with the same word - and if they did this thousands and thousands of times eventually a large percentage of the unknown words would be mislabeled with their word. All they had to do was look at the two words in the captcha, enter the proper label for the ‘easy’ one (presumably that would be the one that the two optical scanners would agree upon) and enter the word “penis” for the hard one. If they did this often enough, then soon a significant percentage of the images would be labeled as ‘penis’ and the ability to autovote would be restored (one side effect, that was not lost on Anonymous, was the notion that for years to come there would be a number of digital books with the word ‘penis’ randomly inserted throughout the text. Update: I asked Ben Maurer, chief engineer of reCAPTCHA about this ‘penis flood‘ attack, Ben says that they’ve anticipated this type of attack and they have numerous protections that will keep the penises from penetrating the reCAPTCHA barrier.

Optimizing reCAPTCHA

As appealing as the notion of sprinkling the word ‘penis’ into texts, the Anonymous team knew that the clock was ticking, and if they were going to restore the Message they didn’t have time to wait for the autovoters to come back online - they were going to have to vote manually, many, many times. And so they needed to be able to enter captcha’s as fast as they could. They developed a set of guidelines that allowed them to quickly decide which reCAPTCHA words they could skip. For example:

You will be given 2 words: 1 real, 1 fake.

For [REAL FAKE] or [FAKE REAL], you can just type in REAL and it should be accepted.

If it’s [LOOKSREAL LOOKSREAL] or [LOOKSFAKE LOOKSFAKE], it’s usually just quicker to just type in both words. Don’t waste precious time deciding which one of them is real.

Use both the appearance and the type of word to identify a fake word. Don’t rely on just one of them.

The whole ruleset is here: fake captcha.

Mathias Bynens
" Ben says that they’ve anticipated this type of attack and they have numerous protections that will keep the penises from penetrating the reCAPTCHA barrier." - That's my vote for the best line seen on SO ever.
T.E.D.
But is not the point of that story that they did not break reCAPTCHA? They instead succeeded by streamlining the manual voting process to allow determined volunteers to vote thousands of times each.
pdc
@pdc, just because they didnt OCR the images (though this could also have been done), doesnt mean they didnt break reCAPTCHA. Think about it like this: Is the purpose of reCAPTCHA to present undecipherable images? Or is it to prevent automated flooding? If its the first, you might be able to argue that it was not broken (arguable, but I would not agree with you), but if its the second - then you have empiric proof that reCAPTCHA does not work. I also think it should be quite clear that aside from entertainment value, the SECOND purpose is the real one, and only one that counts.
AviD
+3  A: 

reCAPTCHA has not been defeated. If it had been, then why did Google just buy it and announce they will be applying the technology within Google to increase fraud and spam protection for Google products?

from Google Acquires reCAPTCHA posted to the Google Blog on 9/16/09:

In this way, reCAPTCHA’s unique technology improves the process that converts scanned images into plain text, known as Optical Character Recognition (OCR). This technology also powers large scale text scanning projects like Google Books and Google News Archive Search. Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users. So we'll be applying the technology within Google not only to increase fraud and spam protection for Google products but also to improve our books and newspaper scanning process.

Mike Knowles
+3  A: 

Yes. See my paper - http://bitland.net/captcha.pdf

jwilkins
A: 

There are lots of methods that are used to crap recaptcha. While its hard to use neural netwpork enabled programs to automatically solve them, its possible to grab the image and have amazon's mechanical turk or some equivalent program to solve them.

http://codemagician.wordpress.com/2010/01/22/solving-recaptcha/

redstick
A: 

I'm seeing blog comments on a system protected by reCAPTCHA where the page loads and 1 second later the post was made successfully. The User-Agent was nonsense (in this particular case it claimed to be running Ubuntu 9.25/Firefox 3.8), the referrer was from a completely unrelated site with no link to us.

This is clearly automated.

Benjamin Franz
A: 

Just read an article about this on Slashdot

Looks like this guy downloaded thousands of CAPTCHAS off Facebook and used them to buy loads of tickets.

jonescb
A: 

seems like some guy at TBN got it working: http://thebotnet.com/programming/35077-php-source-code-antirecaptcha/

davinci
A: 

Site Free OCR does not even require registration. It probably can not replace the commerce ones if high quality is needed.

Skimmilk
A: 

on reCAPTCHA following research paper might be interesting: http://pdfcast.org/pdf/security-risks-for-online-services-by-relying-on-recaptcha

daVinci