views:

316

answers:

2

I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?

+8  A: 

You can use the following to detect Cyrillic characters (used in Russian):

[\u0400-\u04FF]+

If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.

JG
+3  A: 

using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.

i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:

koi8-r
windows-1251
iso-8859-5

next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.

mehmet el kasid
I understand your point, but since it's an English forum, detecting if a post contains cyrillic characters may suffice to determine that it is spam.
JG
hmmm, i *was* thinking the original poster was talking about email spam... if that's not the case, and the spam is being entered via the site itself (e.g. on a forum) then i would agree with what you're saying.
mehmet el kasid