views:

539

answers:

5

CAPTCHAs that ask users to read distorted text are fine for sighted people, but a terrible barrier for those who are blind or have other disabilities. Audio alternatives are occasionally available but still don't help those who are both deaf and blind and can be hard to use with a screenreader (which is already reading words to you).

There exist a couple of solutions that use humans to solve the CAPTCHA on behalf of the user, such as WebVisium and Solona, but these rely on the availability of volunteer operators (for example, Solona apparently has just one volunteer so you have to hope he is awake when you want help).

It occurs to me that the volume of CAPTCHA solutions needed by blind people is very low - I'd guess less than a few hundred per day in a populous country like the UK. This means that unlike the bad folks who want to perform an action many times in a short period, a CAPTCHA assistance service for blind people could afford to devote considerable computational resource - for example, a cloud of computers in Amazon EC2 - to identifying the presented text.

My question is this: assuming you don't care about speed very much, and you have lots of computers available, are there algorithms that let you solve the text-distortion CAPTCHAs that are common today, such as those used by reCaptcha? Or are these problems really intractable even with lots of resource and time?

A few notes:

  1. At this point, my question is just theoretical, but clearly any such service would have to carefully control access to keep spammers out. Perhaps only registered blind people would be allowed to use it.

  2. I am aware that an old Yahoo CAPTCHA was broken a few years ago using an algorithm that runs in seconds on a single computer. I am asking whether modern CAPTCHAs can be broken, perhaps more slowly and with more resource.

  3. I am aware that some new CAPTCHA types are appearing, which ask users to identify kittens or orient a picture. These aren't widespread yet, so I'm just asking about text-distortion for now.

+3  A: 

Basically solving a text distortion CAPTCHA consists of three individual steps:

  1. Find out where the interesting parts are
  2. Segment the text into individual letters
  3. Recognize the letters

The only problem that's left which is pretty hard for computers is the second one. The first usually isn't very hard, unless you happen to stumble upon the CAPTCHA from hell. And the third gets solved by computers with a much better success rate than by humans.

An interesting site for learning how CAPTCHAs are broken is the one by the OCR Research Team.

Joey
Thanks Johannes. Are there strategies for attacking problem 2 using multiple computers, perhaps not terribly quickly?
Douglas Squirrel
After some more web searching, it appears that Johannes is right that his problem 2 (known as "segmentation" - see http://en.wikipedia.org/wiki/Segmentation_(image_processing)) is indeed the hard part of this problem. It would be nice to understand better why segmentation is so difficult and (it seems) resists parallelisation, but this appears to be the most accurate answer of those I got. Thanks Johannes!
Douglas Squirrel
Thank you. I'm not terribly sure parallelizing will gain you much regarding segmentation. First of all, the images are usually small and don't contain too many potential regions. Secondly, while proper segmentation will theoretically lead to better character recognition you have no sure way of knowing. Especially with warped and intersected letters you can "recognize" varying letters depending on segmentation but all with no clear winner which one would be correct. Also, you rarely get more than one try for a single CAPTCHA, so trying multiple possible solutions often doesn't work.
Joey
A: 

Hey Douglas,

The introduction of CAPTCHA has certainly made the web less accessible to the visually impaired, and I agree with you in citing this as a significant problem that deserves more attention and concern. However, while CAPTCHA can be and has been inconsistently bypassed on popular web sites, I don't think this is a viable long-term solution for those in need. Indeed, the day that the CAPTCHA variants currently present on sites like Facebook, Google, MySpace etc. can be reliably and consistently broken is the day they will become obsolete and abandoned for either stronger variants of the same or an entirely new solution (as you implied, distinguishing cats from dogs in pictures has been a popular alternative trend).

When it comes to online accessibility, what I think those with disabilities need most right now is advocacy. The more people contact software companies, open source groups, and standards bodies and speak out about this need, the more awareness will be raised and that will (hopefully) lead to more action on behalf of the development community. Ultimately, it would be great to see sites like Google or Facebook offering alternative access methods just for their visually impaired users.

Idealism aside, I think it is productive to pursue other avenues like you mentioned with the CAPTCHA volunteer network, possibly even the development of something like OpenID for those with relevant disabilities as a universal form validation pass.

As for the technical aspect of your question, I don't think the availability of additional processing power alone will allow you to reliably and consistently break CAPTCHA. There is A LOT of money in spam, and you can be sure that shady SEO companies and Spammers alike have a great number of servers at their disposal. As Johannes Rössel mentioned, if you want to learn more about how this is done and where the technical difficulty lies, research Optical Character Recognition (OCR) and look at the wide variety of number/letter skewing that occurs on high traffic sites.

Mark Hammonds
Thanks Mark. The difference between this situation and that of a spammer or shady SEO person is that they have to break the CAPTCHA many times in a very short period, whereas a blind person (or a service working on their behalf) has a much lower volume and speed requirement. So if there were a relatively expensive algorithm (where expense is measured in time or CPUs or both) for solving a particular type of CAPTCHA, that wouldn't necessarily mean that big sites would abandon that type of CAPTCHA, nor could we necessarily conclude that shady users would already be using it.
Douglas Squirrel
I understand and agree with your logic on that, but my point above was that CPU power isn't the issue. The issue is OCR character recognition, which currently isn't advanced enough to accurately and consistently identify the more complex CAPTCHA variants. Unless OCR is robust enough to automatically identify purposely misconstrued characters the way the human brain is able to, the only alternative is to take educated guesses at brute forcing a solution, which in my belief is likely to fail not because of available CPU power but because of other network related issues/constraints.
Mark Hammonds
Have you considered other methods of providing access for visually impaired users that involve working with service providers to bypass, rather than defeat, CAPTCHA?
Mark Hammonds
@Mark, thanks. Can you provide any evidence for the statement that OCR is not advanced enough to break CAPTCHAs? The answer I added below suggests that in at least some cases, multiple OCRs can co-operate to achieve this, but it would be helpful to learn why this is wrong.Also, of course it would be best if service providers allowed a bypass mechanism, but as blind users are such a small minority of users, it is nearly impossible to get their attention (witness the large number - 91% by some reports - of sites that are in part or in total unusable with screenreaders).
Douglas Squirrel
A: 

This related SO question has a number of good ideas in it, including a DEFCON talk that claims using multiple OCRs and voting breaks many simple CAPTCHAs. This suggests a candidate solution method: distribute the problem over several servers, each of which runs one or more OCR tools in parallel, collect the results, and take the most popular answer. Comments welcome.

Douglas Squirrel
+1  A: 

My answer to your question "are these problems really intractable even with lots of resource and time?" is to point out that this is the very reason that CAPTCHAs work.

My understanding is that the purpose of a CAPTCHA is to prove that you are human rather than a spam bot. reCAPTCHAs are a novel take on this theme because they take images that represent text that cannot be resolved by OCR (optical character recognition) engines. The difference between a person and a machine in this instance is that specialized algorithm(s) has tried to interpret this image and failed while a "normal" person has the intrinsic ability to interpret the text in a consistently human way. That being said, in the future we hope that someone will come up with better OCR engines so that there needs to be less human intervention in digitizing the worlds information. We hope that someone will come up with an tractable solution to this particular problem.

From your point of view of trying to make CAPTCHAs more accessible to blind people -- who still need to prove that they're people rather than spam bots -- the community needs to become aware of this issue and find a way to identify people in a less vision centric way.

pgwillia
You say that the images "cannot be resolved by OCR...engines". Do you have evidence for this? Simply saying that they are supposed to be uncrackable doesn't mean they are. (And they could still be effective at stopping malicious use even if they are crackable with an sufficiently inefficient algorithm. See comment on Mark Hammonds' answer.)
Douglas Squirrel
You might be interested in "reCAPTCHA: Human-Based CharacterRecognition via Web Security Measures" (http://recaptcha.net/reCAPTCHA_Science.pdf). One conclusion of the authors is "Thus, any program that can recognize these words with nonnegligible probability would represent an improvement over state of the art OCR programs."reCAPTCHAs are text images scanned by the Internet Archives (http://www.archive.org/index.php) and now Google Books (http://googleblog.blogspot.com/2009/09/teaching-computers-to-read-google.html) which represent the 20% of words that the OCR used can't resolve.
pgwillia
+1  A: 

CAPTCHA has been created to avoid machines from detecting the words. It's meant to be read by humans only. Making it more readable for blind/deaf people adds a risk of machines being able to understand them again, thus nullifying their effect.

Spammers did find a very effective way to break the more popular CAPTCHA's though. They just hire cheap labourers to read them, in return for a few cents per working account. As a result, there's a small industry around breaking CAPTCHA's to create millions of accounts that can then be used to send more spam. Compared to the amount gained by the spammers, the costs is almost none. A similar solution could be used by blind/deaf people, who would send the CAPTCHA image to some cheap labourer in China or wherever, where they will reply with the correct words and the blind/deaf person will be able to proceed. Unfortunately, blind people only need this service only a few times while spammers need a continuous flow, thus those labourers will prefer to work for spammers instead. (The pay is better.) Still, the best solution would be to send the CAPTCHA to some friend, let them read and/or decipher it and return the answer.

The ReCAPTCHA style also reads out the words. A simple speech recognition application might be able to recognise whatever is said, although speech recognition still needs more optimizations. Still, you might want to work from that angle, getting the application to listen to the sound byte instead.

When it is possible to break CAPTCHA's, they will just think of better CAPTCHA-like methods. OCR techniques are still improving thus more work will be done to make CAPTCHA's harder. That is, until OCR has become as good as the human eye at recognizing words...

An algorithm could be created, although slow. With 26 lowercase and 26 uppercase letters and 10 digits, it should not be too difficult to come up with an algorithm. With Serif and Sans-serif fonts, the number of combinations would need to be doubled, though. Still, if you try to curve all letters in a similar way as the letter in the CAPTCHA, you should be able to detect a letter which gets covered by the CAPTCHA letter the most. And that would be the most likely candidate. Still needs you to clear lines, dirt and other artefacts from the image that the human eye has less trouble to recognise than a computer. You'd need the following steps:

  1. Clean up the image.
  2. Detect the locations of the letters.
  3. For every letter 3a. Determine the curve of the letter by checking the left side. 3b. Do an overlay of every possible letter/digit to find the one that covers it the best. (That's the most likely letter.)
  4. Once you've found the word, do a dictionary check to make sure it's a real word. (Unless the CAPTCHA doesn't use real words.)

Even though they can twist the letters in the CAPTCHA's, it should be possible to detect the twist rotation that they used simply by looking at the left side of every letter and then trying to apply the same curve to every letter. (52 combinations, plus 10 digits, if digits are also used.) Basically, you'd try to put a box around every letter, then check which letter will contain the least amount of white space. That's the most likely letter.

The main reason why this isn't often used for OCR is basically the need for speed. Step 3a/b tends to be slow, especially if you have to take font style in consideration.


Making this answer bigger but in reply to one of the comments:

There are several ways to cleanup an image. You'd need some color filtering, noise reduction and an algorithm that's able to recognise the noisy lines through an image. The DEFCON slideshow that you've pointed to shows a few simple techniques to filter away some of the noise. It shows that a basic image processing tool can already make an image a lot clearer for a machine to read. A simple blur will clean up random dots and thin lines while color filters would filter away the noisy colours. A next step would be to try to put a box around every letter in the CAPTCHA, hoping the system is able to recognise their locations. I don't know any practical algorithms for this but there should be ways to recognise them. There's software that can create vector images from bitmaps, thus there should be software that's able to calculate a box around a letter. It is likely that this box won't have rectangular corners, thus you would have to distort all 52 letters to match the same box. Italic or bold shouldn't make much of a difference since these styles are just additional distortions. Serif or Sans-serif does make a difference, though. Serif fonts tend to have a few more spikes and ornaments. Fortunately, there are algorithms that can transform a box to any other figure with four corners.

Regular OCR applications will assume that letters are mostly straight and will just check a few hotspots to find a match. Thus, they sometimes get it wrong because of noise. To crack CAPTCHA, you would need a more sensitive match, preferably "XOR-ing" the CAPTCHA letter image with an image of one of the 52 letters, then counting the number of black and white spots to calculate the ratio. Assuming white=1 and black=0, the result of the XOR should be almost black for the best match.

I think several spammers have already found some useful algorithms to crack CAPTCHA's but for them, keeping these algorithms a secret just keeps them in business.


Another comment, more text. :-)

Segmentation would be a problem, but it's not impossible to solve. It's just extremely complex. But when you've cleaned the image, it should be possible to calculate two lines. One line that touches the bottom of every letter and a second line that touches the top. However, good CAPTCHA's won't put letters on the same lines any more, but those not-so-good ones could be cracked by just following the lines. (Guess? ReCAPTCHA puts letters between two lines!) With two lines, you know the first letter will start at the left, thus you can try overlaying all 52 possibilities there until you've found a match. When you found one, move to the right for the second one. And further until you've read all letters. With two lines to guide you, you don't need a complete box.

Letters tend to use a constant ratio between width and height. With two lines, you can calculate the height of the complete letter and thus get a good estimation of the matching width.

Still, working out the correct algorithm to calculate this all is a bit too much for my poor math skills. You'd need an expert mathematician to crack this algorithm.

Workshop Alex
Thanks Alex. Are there existing implementations of your proposed algorithm? Seems complex to implement, and not clear that it would work - e.g. I'm not sure that your step 1 ("cleanup") is simple, and Johannes points out in his answer that your step 2 ("segmentation") is actually the hardest bit.
Douglas Squirrel
"there should be software that's able to calculate a box around a letter" - unfortunately this appears from my research to be incorrect. This "segmentation" problem is actually the showstopper, as Johannes points out in his answer.
Douglas Squirrel