views:

674

answers:

11

For the purpose of identifying [possible] bot-generated usernames.

Suppose you have a username like "bilbomoothof" .. it may be nonsense, but it still contains pronouncable sounds and so appears human-generated.

I accept that it could have been randomly generated from a dictionary of syllables, or word parts, but let's assume for a moment that the bot in question is a bit rubbish.

  1. Suppose you have a username like "sdfgbhm342r3f", to a human this is clearly a random string. But can this be identified programatically?
  2. Are there any algorithms available (similar to Soundex, etc..) that can identify pronounceable sounds within a string like this?

Solutions applicable in PHP/MySQL most appreciated.

+2  A: 
Artem Barger
+2  A: 

Off the top of my head, you could look for syllables, making use of soundex. That's the direction I would explore, based on the assumption that a pronounceable word has at least one syllable.

EDIT: Here's a function for counting syllables:

function count_syllables($word) {

$subsyl = Array(
'cial'
,'tia'
 ,'cius'
 ,'cious'
 ,'giu'
 ,'ion'
 ,'iou'
 ,'sia$'
 ,'.ely$'
 );

 $addsyl = Array(
 'ia'
 ,'riet'
 ,'dien'
 ,'iu'
 ,'io'
 ,'ii'
 ,'[aeiouym]bl$'
 ,'[aeiou]{3}'
 ,'^mc'
 ,'ism$'
 ,'([^aeiouy])\1l$'
 ,'[^l]lien'
 ,'^coa[dglx].'
 ,'[^gq]ua[^auieo]'
 ,'dnt$'
 );

 // Based on Greg Fast's Perl module Lingua::EN::Syllables
 $word = preg_replace('/[^a-z]/is', '', strtolower($word));
 $word_parts = preg_split('/[^aeiouy]+/', $word);
 foreach ($word_parts as $key => $value) {
 if ($value <> '') {
 $valid_word_parts[] = $value;
 }
 }

 $syllables = 0;
 // Thanks to Joe Kovar for correcting a bug in the following lines
 foreach ($subsyl as $syl) {
 $syllables -= preg_match('~'.$syl.'~', $word);
 }
 foreach ($addsyl as $syl) {
 $syllables += preg_match('~'.$syl.'~', $word);
 }
 if (strlen($word) == 1) {
 $syllables++;
 }
 $syllables += count($valid_word_parts);
 $syllables = ($syllables == 0) ? 1 : $syllables;
 return $syllables;
 }

From this very interesting link:

http://www.addedbytes.com/php/flesch-kincaid-function/

karim79
Nice, but then you need to produce a dictionary in order to be able to use it. And even after that you still can miss some cases.
Artem Barger
@Artem - Nothing is going to be an 100% effective solution for this problem
karim79
@karim - like I've said ;)
Artem Barger
@artem @karim a 100% solution is not expected. This test would be just one indicator of spam, other behaviour analysis will be performed.
Tim Whitlock
@Tim, your question is whenever it's possible to determine that given string is generated or not programmatically. So the clear answer is no. In case you are looking for approximation and heuristics you need to specify it in your question.
Artem Barger
+13  A: 

I guess you could think of something like that if you could restrict yourself to pronounceable sounds in english. For me (I am French), words like szczepan or wawrzyniec are unpronounceable and certainly have a certain randomness.

But they are actually Polish first names (meaning steven and lawrence)...

Mac
A: 

I dont know of existing algorithms for this problem, but I think it can be attacked in any one of the following ways:

  • your bot may be rubbish, but you can keep a list of syllables, or more specifically, phonemes, that you can try finding in your given string. But this sounds a bit difficult becasuse you would need to segment the string in different places etc.
  • there are 5 vowels in the english alphabet, and 21 others. You could assume that if they were randomly generated, then approximately you would expect 5/26*W, (where W is word length) letters that are vowels, and significant deviations from this could be suspicious. (If letter are included then 5/31 and so on..) You can try building on this idea by searching for doubletons, and trying to make sure that each doubleton occurs with same probability etc.
  • further, you can try to segment your input string around vowels, example three lettters before a vowel and three letters after a vowel, and try to find out if it make a recognizable sound by comparing with phonemes.
This is true for words, but not user name, that can mean nothing, or be acronyms, etc.
Clement Herreman
Re bullet #1.This is similar to my thinking, except that some letters are more common. ( "e" vs "x" ) So a more sophisticated formula would be required.It is true that usernames could mean nothing, but this is a somewhat academic exercise
Tim Whitlock
A: 

In Russian, we have forbidden syllables, like ГЙ, а Ъ or Ь after a vowel and so on.

However, spam bots just use the names database, that's why my spam inbox is full of strange names you can only meet in history books.

I expect English to have syllable distribution histograms too (like ETAOIN SHRDLU, but for two-letter or even three-letter syllables), and having critical density of low frequency syllables in one name is certainly a sign.

Quassnoi
There are several hundred common trigrams in the english language. The length of the average nickname is just a few letters. There is not enough data there to get a reliable measure of normality using this model.
Markus Koivisto
@Markus: if we have name like `gfwx`, we have two trigrams: `gfw` and `fwx`, which I think are never met in English corpus. That is, we have `2` zero-probability trigrams in one name, which certainly rings a bell.
Quassnoi
+7  A: 

I agree with Mac. But more than that, people sometimes have user name that aren't pronouncable, like qwerty or rtfmorleave.

Why bother with that ?

< obsolete and false, but i don't delete because of comments >

But more than that, no bots use 'zetztzgsd' as user name, they have dictionnary of realname, possible nick name, etc. so I think this would be a waster of time for you

< / obsolete and false, but i don't delete because of comments>

Clement Herreman
@clement not true. a lot of bot usernames on Twitter have very poor auto-generated names, equally as poor as "zetztzgsd"regarding people with unpronouncable usernames. This is fine as the test is only an indicator, it won't be relied upon 100%, other tests on behaviour will be performed
Tim Whitlock
It's just another thing that can be added to an overall weighting as to whether a user is genuine - it wouldn't be the only indicator used.
Mr. Matt
@Tim really ? i though bot designer would be more imaginative. You are both right, i can't be 100% accurate but can help
Clement Herreman
I have a page ranked high on Google which collects data from a form that is not CAPTCHA protected. Here are some sample names from bots: `asdfsdaff`, `Rihanna nude` (and lots of other artist names), `kvsdpeqoqby`, `ygwyss`, `tbjoezlonzu`. The majority of them are of the "x nude" variety, though. The e-mails are always garbled however, e.g., `[email protected]`, `[email protected]`.
Blixt
Ty all for these precisions, I'll be more carefull for future
Clement Herreman
I'd like to add that while the first name is definitely from a bot (based on other content it submitted), the name is also definitely entered by a human. Note how it only makes use of the first four characters on the second row of letters on a QWERTY keyboard. You could make an algorithm for detecting human typed random names that makes the name more likely to belong to a human (although in this case it was a bot after all, so it might work against you as well.)
Blixt
The questions states to "assume for a moment that the bot in question is a bit rubbish"
T Pops
+4  A: 

Just use CAPTCHA as a part of the registration process.

You can never distinguish real uesrnames from bot-created usernames, without severely annoying your users.

You will block users with bizzare, or non-English names, which will irritate them, and the bots will just keep trying until they catch a good username (from dictionary, or other sources - this is a very nice one, by the way!).

EDIT : Looking for prevention rather than after-the-fact analysis?

The solution is letting somebody else manage user's identities for you. For instance, you can use a small list of OpenID providers (like SO), or facebook connect, or both. You'll know for sure that the users are real, and that they have been solving at least one CAPTCHA.

EDIT: Another Idea

Search the string in Google, and check the number of matches found. Shouldn't be your only tool, but it is a good indicator, too. Randomized strings, of course, should have little or no matches.

Adam Matan
Thanks for the response, but this is after-the-fact analysis, not prevention.
Tim Whitlock
Updated my answer.
Adam Matan
+1  A: 

You could use a neural network to evaluate whether the nickname looks like a natural-language nickname.

Assemble two data-sets: one of valid nicknames, and one of bogus-generated ones. Train a simple back-progating single hidden layer neural network with the character values as inputs. The neural network will learn to discriminate between strings like "zrgssgbt" and "zargbyt", since the latter has consonants and vowels intermingled .

It is important to use real-world examples to get a good discriminator.

Markus Koivisto
+2  A: 

Look up n-gram analysis. It is successfully used to automatically detect text language and works surprisingly well even on very short texts.

The online demo recognized 'bilbomoothof' as English and 'sdfgbhm342r3f' as Nepali. It probably always returns the best match, even if it's a very poor one. I think you could train it to discern between 'pronounceable' and 'random'.

Rafał Dowgird
A: 

Note that many large sites suggest usernames like [first init][middle init][last name][number]. The users then carry these usernames over to other sites, and the first three letters are definitely not pronounceable.

smackfu
A: 

Note that you may inadvertently block hilarious web comic authors from your site!

"xkcd is not an acronym, and Munroe attaches no meaning to the name, except in a joking manner within the comic. He claims that the name was originally a screen name, which he selected as a combination of letters that would be meaningless, as well as phonetically unpronounceable."

;)

-Matt

Matt Baker