views:

295

answers:

2

Does anyone have a good regex for stripping all symbols (';.,_\$@!% the carriage return etc) from a string, without damaging any foreign characters (é 多 فا etc)? Non-regex would be even better, I suppose, but I don't see any Ruby or Rails methods that do this.

+2  A: 

The good way to do this would be to use the new(ish) unicode character classes in regex, such as \P{L} to match anything that is not a letter (in any language) according to unicode. Unfortunately, it seems that Ruby doesn't support this, even in 1.9.

Perhaps the 1.9 regex parser is smart enough to not match the bytes that make up special symbols in unicode characters, so simple enumerating all the characters to strip can work, though. That assumes you really can enumerate all characters you wish to filter out, which might be a lot more than the symbols in ASCII, like logical not, aeroplane, etc...

calmh
+3  A: 

What is a symbol? This seems like a fuzzy requirement. Is & a symbol, even though it's just shorthand for the word "and"? Is ! a symbol, even though it's used as an alphabetic character in transliterating some African languages? If $ is a symbol, does that mean 円 is as well? I think answering this question will go a long way to suggesting a course of action.

I think the closest you are likely to get with a regexp is /[^\w\s]/. Ruby 1.9's Regexp engine is meant to understand foreign languages well enough to correctly know which are "word" characters, so this will leave those and spaces. In my tests, this correctly removes punctuation from English, Japanese and German sentences while leaving the surrounding characters. But dollars to doughnuts there will be edge cases that trip up just about any solution — dealing with the huge variety of languages in the world (some of which don't even have words as we know them) is an incredibly complex task.

Chuck
WIlliam Jones
@williamjones: Actually, what I meant about "円" is that it's used both as the equivalent to the $ symbol and as an ordinary character to form words (for example, shibunen, meaning a quadrant). The regexp I suggested in my changes will always leave it in, incidentally.
Chuck
Thanks, that seems to work really well from my preliminary tests. Only change I made was to use /[^\w ]/ instead to get rid of non-space white space characters like carriage returns.
WIlliam Jones