views:

126

answers:

5

Hi folks,

this is correlated to a question I asked earlier (question)

I have a list of manually created strings such as:

lucy87

gordan_king

fancy_unicorn77

joplucky_kanga90

base_belong_to_narwhals

and a list of randomized strings:

johnkdf

pancake90kgjd

fancy_jagookfk

manhattanljg


What gives away that the last set of strings are randomized is that sequences such as 'kjg', 'jgf', 'lkd', ... .

Any clever way I could separate strings that contain these apparently randomized strings from the crowd?

I guess that this plays a lot on the fact that certain characters are more likely to be placed next to others (e.g. 'co', 'ka', 'ja', ...).


Any ideas on this one? Kylotan mentioned Reverend, but I am not sure if it can be used fr such purpose.

Help would be much appreciated!

+4  A: 

This is just a thought. I've never tried it myself...

Build a bloom filter from hashing every (overlapping) 4-letter sequence found in a dictionary. Test a string by counting how many 4-letter sequences in the string don't hit the filter. The more misses, the more likely it is that the word contains random junk.

Try tuning the size of the bloom filter and the number of letters per sequence.

Also note (thanks @MihaiD) that you should include a dictionary of names, preferably from multiple languages, in the bloom filter to minimise false positives.

Marcelo Cantos
Sounds good to me. At the end of the day, there's no objective definition for "random" - you have to decide what you think isn't random, and then look for it. This sounds like as good a way as any..
fenomas
This is probably the best approach available. However usernames are often not like normal words and people tend to use all sort of quirky abbreviations. You will also have problems if usernames contain words from more than one language, since your classifier is trained on English words (training it on multiple languages will make it harder and harder to detect 'random' words).
MihaiD
@MihaiD, thanks for pointing that out. I will amend the answer.
Marcelo Cantos
+1  A: 

Try using a vanilla bayes classifier. Should be enough for the general case.

piotr
+1  A: 

It seems to me like you are trying to write code to recognize a certian particular set of tiny stuff some spammer does to a string to get past your filters. What I don't see is what is stopping them from, after all your hard work, making a 10-second tweak to their algorithm and defeating your new filter.

T.E.D.
@T.E.D. it is actually the most common occurance there is among spam entries.
RadiantHex
That's not my point. There might come a day where all your spammers start ending all their usernames with "omgponies". That doesn't mean that disallowing "omgponies" in a user name will slow them down one bit.
T.E.D.
+1  A: 

Some time ago I read a short article about random name generation, where they did the following: They built up a table that contains the information you already pointed at: "I guess that this plays a lot on the fact that certain characters are more likely to be placed next to others".

So what they did was they read a whole dictionary and determined which letters were placed more likely to each others. I do not know, how much letters in a row they considered. Maybe you should try more than just two consecutive letters, let's say something between 3 and 6.

Now I suggest you bild up such a table (maybe in a better data structural representation), that contains all "valid" consecutive letter combinations (and maybe their likelihood) and look if your name to be checked contains (almost) only such "valid" consecutive letters.

phimuemue
+2  A: 

What scores do you get if you run the strings through something like textcat? (I've seen a few different implementations of TextCat; maybe there's a Python one already out there; if not it's not a hard algorithm -- it's the data that's important.)

I'm thinking that if you strip the numbers out, the first set of strings will be closer to the "English" result in TextCat than the ones with random stuff in them.

How much closer and whether you might be able to use the TextCat data -- which is fundamentally based on which letters tend to be next to each other in particular languages -- to "pass" or "fail" a string is going to need some experimentation, though...

Matt Gibson