hi there, i'm running a dating site and there is a place where people enter their profile - I already have a bad-words filter but now I have a problem where people enter a profile that is just garbage characters or just "aaaaaaaaaaaaaaaaaaaa" or "--------------" etc. I'm looking for an effective way of filtering out the long words of repeated characters. thanks in advance.
+2
A:
Maybe you need some bayesian spam filter-alike filter for that kind of stuff.
Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. ...
The MYYN
2010-07-15 09:29:13
A:
You could use a word-list, and flag each message that has long words (e.g. 5+ chars) not on the list - if the field contains 5 8-letter words, of which none are in a dictionary, it's likely it's not meaningful data.
Piskvor
2010-07-15 09:37:18