views:

92

answers:

4

The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout".

Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh

Is there any software that does this already (preferably free and open source) ?

If not, is there an active FOSS project whose goal is to achieve this?

If not, how would you suggest to implement such a software?

+2  A: 

Most keyboard mashing tends to be on the home row in my experience. It would be reasonably simple to check to see if a high proportion of the characters used are asdfjkl;.

fredley
wow I never noticed that, but that's so true about my random mashing!
Blindy
+3  A: 

If two bigrams in analyzed text are close in QWERTY terms but have near zero statistical frequency in English language (like pairs "fg" or "cd") then there is chance that random keyboard hits are involved. If more such pairs are found then chance increases greatly.

If you want to take into account the use of both hands for bashing then test letters that are separated with another letter for QWERTY closeness, but two bigrams (or even trigrams) for bigram frequency. For example in text "flsjf" you would check F and S for QWERTY distance, but bigrams FL and LS (or trigram FLS) for frequency.

Dialecticus
+1 this sounds good, but first the list of these common bigrams for gibberish needs to extracted; otherwise the end result would be based on guesstimates (guessing which bigrams or trigrams are characteristic for gibberish).
Unreason
Maybe for OP it needs to be stated that bigram matching is the common algorithm found in spell checkers
Unreason
Accepted. For reference, I would like to add that repetition of an unusual bigram is a quasi-sure sign.
Nicolas Raoul
A: 

Fredley's answer can be extended to a grammar that would construct words from nearby letters.

For example asasasasasdf could be generated with a grammar that connects as, sa, sd and df.

With such grammar, expanded to all letters on the keyboard (with letters that are next to each other) could, after parsing, give you a measure of how much of a text can be generated with this 'gibberish' grammar.

Caveat: of course, any text discussing such grammar and listing examples of 'gibberish' text would score significantly higher then a regular spell-checked text.

Do note that the example approach would not catch vandalism in the form of 'h4x0r rulezzzzz!!!!!'.

Another approach here (which can be integrated with the above method) would be to statistically analyze a corpus of vandalized text and try to get common words in vandalized texts.

EDIT:
Since you are assuming QWERTY, I guess we could assume English, too?

What about KISS - run the text through english spell checker and if it fails miserably conclude that it is probably gibberish (the question is, why want to distinguish quickly typed gibberish from random nonsense or for that matter from very badly spelled text?)

Alternatively if other keyboard layouts (Dvorak, anyone?) and languages are to be considered, then maybe run the text through all available language spell checkers and then proceed (this would give language autodetect, too).

This would not be very efficient method, but could be used as a baseline test.

Note:
In the long run I imagine that vandals would adapt and start vandalizing with, for example excerpts from other wikipedia pages, which would be ultimately hard to automatically detect as vandalism (ok, existing texts could be checksummed and flag raised on duplicates, but if text came from some other source it would be ultimately hard).

Unreason
About your "Do note" paragraph: Indeed, the 'h4x0r rulezzzzz!!!!!' case is not targeted here, and it is actually taken care of by other means, which the winner's paper talks about. In brief: Character repetition of "zzzzz" and excessive punctuation would already mark it as probable vandalism.
Nicolas Raoul
+1  A: 

Consider empirical distribution of sequences of two letters, ie "probability of having letter a given it follows letter b", all this probabilities fill a table of size 27x27 (considering space as a letter).

Now, compare this with historical data from a bunch of english/french/whatever texts. Use Kullback divergence for comparison.

Alexandre C.
Am I right that to implement your solution I need a corpus of "mashed text" ?
Nicolas Raoul
you need a corpus of standard english text (like wikipedia articles).
Alexandre C.
I think only considering the last version of the article (unless it's really short) is likely to work for the Wikipedia example.
Matthieu M.