tags:

views:

55

answers:

1

Many people are tired of obtrusive words with no value, like these:

  • f**king
  • Id|ot
  • <|>
  • whaaaat????!!!!???

I plan to detect suspicious records and then to verify them manually. In other words, to find rules which detect that something is most likely obtrusive. Is there any reasonable solution? I am thinking about these REGEX rules:

\w\W+\w
\D{3,}

Can you recommend some other rules or links to read?

Edit

Because of "Hello, my friend" the first rule could be

\w[^\w\s]+\w
+1  A: 

I would use Bayesian filtering featurizing misspellings that are combinations of alphas and other characters (e.g. all of the examples you've provided). This has the decided benefit that it "learns" over time, but needs to be fed an initial training set before it can produce useful results. To fit your needs you would set the threshold for matching low so you'd get false positives that you'd have to allow (and hopefully the algorithm would not allow through too many false negatives).

Toby Segaran's Programming Collective Intelligence provides a good explanation and Python code for making this work.

orangepips
This seems promising. But according to other reasonable comments, I will try to avoid autodetects. Thank you anyway.
Jan Turoň
I think other comments are making the valid point it's always possible to defeat a detection algorithm. That noted, implementing something like this will make it much easier to detect the majority of offending items.
orangepips