views:

219

answers:

1

I'm fascinated by the CAPTCHA system used on SO... I would like to know more about the "many factors" which make reCAPTCHA work. The developers, understandably given the potential for abuse, keep rather quiet about the exact inner workings of their system... But the behavior is well-documented, and so perhaps my curiosity can still be sated:

If I were to design a clone of reCAPTCHA, how might I go about it?


reCAPTCHA allows:

  1. a typing mistake
  2. at a place where people do them. This suggests me that you need to have historical data about errors, and then make an algorithm based on that.

The detection of typing mistakes requires extensive use of databases: one for words from books being digitized and the other for words which are known.

Technical known details

  1. two databases: one for known words and the other for unknown words
  2. subsequent database for combination of word

Unknown technical details

  1. How can the words be separated on fly such that you see a combination of words from different databases? This is about signal-processing.
  2. How can the data from two databases be given for user?
  3. Which is the initial form of data in two separate databases? PDF?
  4. Which is the subsequent form of data when data from two databases is combined? Pdf?
  5. How can the data be combined to from two pdf -files to one?
  6. How can you effectively rotate images?
  7. Which algorithms are used to separate the images from the book?

Related topics

  1. signal processing
  2. calculus: series such as Fourier and Laplace for algorithms in word detections.
  3. probability theory: to have a "computer-human" coefficient which is only passed if it is, for instance, with 95 confidence interval
  4. Perhaps number theory: we need to be effective in storing and comparing the data
+4  A: 

reCaptcha

Ólafur Waage
I read the pages. However, it does not answer my question. It does not say how Captcha really works. How many typing mistakes does Catpcha allow? If Captcha is unsure about the correct word, how does it decide whether user's letter is correct or not. -- Your link mentions that the words are the ones computers cannot read. => IF computer cannot read the words, how do they know whether user gives a right answer?
Masi
It's on their wiki page under FAQ: http://wiki.recaptcha.net/index.php/FAQ#reCAPTCHA_is_accepting_incorrect_words
Ólafur Waage
@Waage: It seems that they keep the api hidden: "This is tuned dynamically based on many factors."
Masi