tags:

views:

301

answers:

1

We are using JCaptcha for a captcha tool in a small app that my team is writing. However, just during development time (on a small team - 4 of us), we've run across a number of curse words and other potentially offensive words for the actual captchas. Is there a way to filter out potentially offensive words so that they are not presented to the user?

+3  A: 

I spent time downloading JCaptcha and looking at the source. Basically JCatpcha works like every single captcha out there besides ReCaptcha. Hence what you want to is trivial.

JCaptcha is using the very simple concept of a WordGenerator, which is and interface:

public interface WordGenerator {
    String getWord(Integer length);
    String getWord(Integer length, Locale locale);
}

Let us ignore localization.

Typical use is like this:

WordGenerator words = ...
WordToImage word2image = new SimpleWordToImage();
ImageCaptchaFactory factory = new GimpyFactory(words, word2image);
pixCaptcha = factory.getImageCaptcha();

In their unit tests we can see, for testing purpose:

    WordGenerator words = new DummyWordGenerator("TESTING");
    WordToImage word2image = new SimpleWordToImage();
    ImageCaptchaFactory factory = new GimpyFactory(words, word2image);
    pixCaptcha = factory.getImageCaptcha();

Note that we have ENTIRE control on the "WordGenerator" used.

Here's one (working, fully functional) word generator I just wrote:

private static final Random r = new Random( System.currentTimeMillis() );

public String getWord( final Integer length ) {
    final StringBuilder sb = new StringBuilder();
    for (int i = 0; i < length; i++) {
        final int rnd = r.nextInt( 52 );
        final char c = (char) (rnd < 26 ? 'a' + rnd : 'A' + (rnd-26));
        sb.append( c );
    }
    return sb.toString();
}

It generates random "words" like these:

fqXVxId
cdVWBSZ
zXeJFaY
aeoSeEb
OuBfzvL
unYewjG
EhbzRup
GkXkTyQ
yDGnHmh
mRFgHWM
FFBkTLF
DvCHIIT
fDmjqLH
XMWSOpa
muukLLN
jUedgYK
FlbWARe
WohMMgZ
lmeLHau
djHRqlc

Note that if you prefer "real words" (like reCaptcha, but reCaptcha is using real word for another purpose altogheter -- because it helps scanning/OCRing books!) it's not an issue, simply change getWord(...) to pick randomly words out of a dictionary.

Now how do you prevent insulting words to be picked up? This is trivial. Here I just give one example (please, no arguing about the code, it's really just one example that shows how it could be done):

private static final Set<String> s = new HashSet<String>();

static {
    s.add( "fuck" );
    s.add( "suck" );
    s.add( "dick" );
}

private static final Random r = new Random( System.currentTimeMillis() );

public String getWord( Integer length ) {
    String cand = getRandomWord( length );
    while ( isSwearWord(cand) ) {
        cand = getRandomWord( length );
    }
    return cand;
}

private boolean isSwearWord( final String w ) {
    return s.contains( w.toLowerCase() );
}

public String getRandomWord( final Integer length ) {
    final StringBuilder sb = new StringBuilder();
    for (int i = 0; i < length; i++) {
        final int rnd = r.nextInt( 52 );
        final char c = (char) (rnd < 26 ? 'a' + rnd : 'A' + (rnd-26));
        sb.append( c );
    }
    return sb.toString();
}

Now if you want to prevent swear words, you probably also want to prevent those close to swear words (eg "fvck" and "dikk" etc.). This is once again trivial:

 private boolean isSwearWord( final String w ) {
    List<String> ls = generateAllPermutationsWithLevenhsteinEditDistanceOne(w);
    for ( final String cand : ls ) {
        if ( s.contains( cand.toLowerCase()) ) {
            return true;
        }
    }
    return false;
}

Writing of the method "generateAllPermutationsWithLevenhsteinEditDistanceOne(w)" is left as an exercice to the reader.

Webinator
quod erat demonstrandum
Webinator
@chris_l: your problem lies in your mind's failure to realize that when it comes to security there's an asymetry: the "defendant" has more infos available then the attacker. Your entire rambling is *exactly* identical to someone who would say *"PKCS ain't working because you can't multiply two huge prime numbers because you can't factor two huge prime numbers"*. Which is a completely circular argument that is *precisely* missing the whole point why PKCS works and why Captcha works. There **MUST** be a way to verify that an answer is correct (or incorrect). We have the words info on the server.
Webinator
btw generating all the permutations that have an Levenhstein Edit Distance of one is easier than computing the edit distance itself.
Webinator
elduff
@elduff: I decided to delete the flamewar in the other post. I agree, Wizard answered your question well (+1).
Chris Lercher