views:

172

answers:

1

Hi All,

My basic question is how to prevent spam and dirty words in a comment post system under python (django).

I have a collection of phrases (approximately 3000 phrases) to be blocked.

What I want to do is like this:

If I found a comment which has a dirty-word when user clicks the post button, then the web should popup a warning message and asked people to re-enter/correct the comments and submit it again. This is just preventing people to submit rude/spam comments.

Question (1), are there any existing open source python (or django) package/module/plugin which can handle this job? I knew there was one called Akismet. But from what I understood, it will not solve my problem. Akismet is just a web service and filter the words dictionary defined by Akismet. But I have my own collection of words. Please correct me if I am wrong.

Question (2), If there is no such open source package I can use, how to create my own one? The only thing I can think of it's to use regular expression and join all the word phrases with 'or' in a regular expression. but I have 3000 phrases, I think it won't work in term of performance and filter every comment post. any suggestions where should I start from?

Thank you very much for your help and time.

+2  A: 

You may want to check out the PROFANITIES_LIST setting, looks like you can use it with validators.

Although, with that many (3000 really? you must be fun at parties) phrases you want to rethink things. You shouldn't filter SPAM. You should throw it away. Just my opinion. If the comment has SPAM in it, why keep it at all? Is there any value added from such a comment?

jobscry
sorry that I didn't express clearly in my first post. I don't save them. If I found a comment has a dirty-word when user clicks the post button, then the web should popup a warning message and ask people to re-enter/correct the comments and submit it again.
are the 3000 words and phrases mostly profanities? i'm now morbidly curious at the thought of expanding my own vocabulary.
jobscry
^_^ not really. (1) those words are not all in English. some are. some are not. that's why the collection is about 3000. not all words are English. (2) those words can be profanities, advertisement spam, pornography etc.. again, some are in english. some are not. (3) by the way, I am really new on python and django. can the method you mentioned solve my problem in term of performance? thanks a lot for your reply and help. ^_^
after searching through code I can only find one place where the PROFANITIES_LIST is used:http://code.djangoproject.com/browser/django/tags/releases/1.2/django/contrib/comments/forms.py#L164 You can probably use that as an example. The code searches for the "bad words" one at a time, 3000 is a lot. This is why pawning this off to a third party could help (like Akismet).
jobscry
Thanks for the reply. I checked the code. It builds a list of badwords that comment contains and use 'in' to check if the bad word in the comment string or not. I think this is VERY inefficient and will lead to bad performance. even regexp will be better than this for my goal. the goal of Akismet is not the same as my goal either. I haven't looked Akismet in detail. maybe will check it later if no solution can be found. Still need to continue to google how to solve my problem.... thanks. I may end up using regexp......