ansaurus

Question

Algorithm for separating nonsense text from meaningful text

Answer 1

+2 A:

Look up Claude Shannon and Markov models. These lead to a statistical technique for assessing probabilities that letter combinations come from a specified language source.

Here are some relevant course notes from Princeton University.

joel.neely 2009-02-01 22:08:26

Answer 2

+10 A:

How about just using some existing implementation of a bayesian spam filter instead of implementing your own. I have had good results with DSpam

John Nilsson 2009-02-01 22:10:01

Answer 3

+5 A:

If you're only expecting (or care about) English comments, then why not simply count the number of valid words (with respect to some dictionary) in the feedback uploaded. If the number passes some threshold, accept the feedback. If not, trash it. This simple heuristic could be extended to other languages by adding their dictionaries.

Andy 2009-02-01 22:10:16

Viagra! Cheap cheap Viagra!

masfenix 2009-02-07 21:55:06

Answer 4

+10 A:

A slightly different approach would be to set up a system to email the feedback messages to an account and use standard spam filtering. You could send them through gmail and let their filtering take a shot at it. Not perfect, but not too much effort to implement either.

Rob Walker 2009-02-01 22:11:23

Oooh, quick and dirty, hackish and somehow thoroughly disgusting...I love it! :D

Rob 2009-02-01 22:26:06

Upvoted for the uniqueness :)

Ross 2009-02-01 22:32:40

+1 for piggybacking off Gmail -- that's probably what I'd do, too; their spam filtering is excellent and as a quick (and quite easy) fix it's definitely worth trying as a first effort. Nice practical and uncomplicated suggestion.

Christian Nunciato 2009-02-02 05:16:53

+1 from me too. That's the programmer spirit, right there :P

Alex Fort 2009-02-02 14:19:36

But would Gmail really filter out a message that says "qwerty"? Even if so, they also look at the sender, subject, server it's mailed from etc, which would all be the same for his application (they are all sent from this one form to the Gmail account).

2009-02-07 22:01:11

If the 'from' address in this scheme is always the same, there's a danger of Gmail just deciding that *that address is a spammer* since it sends so much spam.

Darius Bacon 2009-10-29 18:29:36

Answer 5

+5 A:

I had a spamming problem in a guestbook function on one of my sites a (quite long) while ago. my solution was simply to add a little captcha-like Q&A field asking the user "Are you a spamming robot?" Any answer containing the word "no" (letting through "no, i'm not", "nope" and "not at all" too, just for fun...) permitted the user to post...

The reason I chose not to use captcha was simply that my users wanted a more "cozy" feel to the site, and a captcha felt too formal. This was more personal =)

Tomas Lycken 2009-02-01 22:11:35

Answer 6

+2 A:

The simplest method would be to count the occurrence of each letter. E is the most common letter in English, so it should be used the most. You could also check for word and digraph frequency. Have a look here to get the list of most frequently used anything in English

Marius 2009-02-01 22:13:34

This would be good for detecting the language and filter away unwanted languages. Bunt unfortunately this would not filter nonsense text.

0xA3 2009-02-01 22:25:39

It would filter nonsense text, because nonsense text does not have the right statistics. If you randomly hit the keyboard, then E wont be the most typed letter

Marius 2009-02-01 23:29:52

Statistically, this works for long strings, but not always for short strings. (Note the previous sentence doesn't contain an "E", but that doesn't mean you should mark it as spam.)

RexE 2009-02-01 23:37:19

That is right, but it contains a lot more t's and i's than q's and z's. As long as you have at least a sentence or two, it should work.

Marius 2009-02-02 11:58:45

Answer 7

+6 A:

You might try the Bayesian algorithm used by many spam filters.

Better Bayesian Filtering

Wikipedia explanation

Some open Source

Greg Ogle 2009-02-01 22:18:49

Answer 8

A:

The preceding answers about strapping up some spam filter Bayesian-inspired classfier are a good idea. For your application, since you seem to get a lot of long nonsense words, it would be best to turn on an option in your parser to train on bigrams and trigrams; otherwise, many of the nonsense words will just be treated as "never seen before" which is not the most useful parse in your case.

Liudvikas Bukys 2009-02-02 14:10:30

Answer 9

A:

Fidelis Assis and I have been adapting the spam filter OSBF-Lua so that it can easily be adapted to other applications including web applications. This spam filter won the TREC spam contest three years running. (I don't mind bragging because the algorithm is Fidelis's, not mine.)

If you want to try things out, we have "nearly beta" code at

git clone http://www.cs.tufts.edu/~nr/osbf-lua-temp

We are still a long way from having a tidy release, but the code should build provided you install automake 1.9. Either of us would be happy to advise you on how to use it to clean your database and to integrate it into your application.

Norman Ramsey 2009-02-07 21:40:33

Answer 10

A:

Yes, like people pointed out, you could look at spam filters or Markov Models.

Something simpler would be to just count the different words in each response and sort by frequency. If words like the following are not at the top then it's probably not valid text:

the, a, in, of, and, or, ...

They are the most frequently used word in any usual English text.

2009-02-07 21:58:24

Answer 11

A:

Just store comments in a pending state, pass them through Akismet or Defensio, and use the response to mark them as potential spam or mark them active.

http://akismet.com/

http://defensio.com/

I personally prefer Defensio's API but they both work fantastically well.

Jarin Udom 2009-02-07 22:03:14

ansaurus

tags:

views:

answers:

Algorithm for separating nonsense text from meaningful text

related questions