tags:

views:

139

answers:

5

I'm a school teacher who spent the summer writing a vocab training program in python that uses text available from wikipedia and gutenberg. Now all I have to do is figure out a way to filter out curse words so that I can distribute to students.

Normally I would just have an array (list) of those curse words and do a simple filter. The problem though is that the py file itself it openable by these students, seeing those words. If I put it in a separate file, somehow encrypted, then they could just delete that file and the output wouldn't be filtered.

Any ideas for workarounds?

+6  A: 

What you could do is hash the words you want to search for. It makes the filtering a little harder, since you must break the input into words, hash each word, then see if you have a match for that hash.

Take a look a the documentation for md5()

Your source code will then just contain hashed words, and there is no way to reverse that into a list of words (however, the more creative student might have a few sound guesses!)

Paul Dixon
Thats a good suggestion.
Chris
Then again, if you give some kids access to the source code of a program that says `profanity = [ a list of MD5 hashes ]`, I guarantee that they'll figure out what it is, and waste a bunch of time trying every profane word they know to see what's in the list. :-)
Ken
I did suggest this very scenario :)
Paul Dixon
they are many MD5 online decryption tools, consider using something like sha1
leoluk
You suggested they might have a few guesses. I tried to suggest they'd waste more energy on this guessing than the amusement of finding profanity in their vocabulary trainer. :-)
Ken
@leoluk: "MD5 decryption tools" depend on the fact that someone tried to look for that word before, so it doesn't really "decrypt", but instead look up the MD5 hash in a database of known-plaintext. This system works exactly the same in SHA-1 (and all other hashing algorithms), so switching to that won't prevent this.
Joachim Sauer
Also: adding a [salt](http://en.wikipedia.org/wiki/Salt_(cryptography%29) might reduce the problem.
Joachim Sauer
+6  A: 

If all the students are running the same version of Python (e.g. at a computer lab), you can distribute pyc files. This is just obfuscation, but it will deter casual users.

Matthew Flaschen
This is really the best solution. In 4th grade I'm sure my droogs and I looked through the school dictionary for whatever anatomical words we could find. That wasn't the fault of the dictionary, any more so than dumping a program would be the fault of the teacher. Good luck generating the stop list such that it classifies the titillating words from those that aren't and remember the http://en.wikipedia.org/wiki/Scunthorpe_problem
msw
A very good solution indeed. The only reason I won't be doing this is because students are going to be putting the program on their own personal laptops and I'd rather avoid the whole "You must download and install this version of Python" thing.
Adam Morris
You can use py2exe, this includes the interpreter with the compiled file.
leoluk
+1  A: 

Include the words in your distributed file, but encrypt them somehow to make it impossible to easily find the plaintext list. Then, compile the script to .exe using py2exe. This will stop most students from reverse-engineering the program and finding the encrypting algorithm.

If a student finds the decryption routine, it doesn't matters if it's a strong encryption or not, so rot13 or base64 should be enough.

(w.decode('rot13') for w in ['sbeovqqra', 'sbeovqqra gbb', 'rira zber sbeovqqra'])

For making the list, just use encode on the real words.

Hashes, like suggested above, will work too, of course.

leoluk
A: 

Why not distribute the compiled python .*pyc files instead? They could still lookup the strings if they wanted, but it will likely deter casual browsing of the file, which may be sufficient for your needs.

ars
A: 
def avoid_paperwork(word):
    result = ''
    for letter in word:
        if letter == 'z':
            result += '_'
        else:
            result += chr(ord(letter) + 1)
    return result
Adam Morris
This will give me a usable variable name as well.
Adam Morris