tags:

views:

75

answers:

5

Hey,

I am trying to come up with a regex to remove all special characters except some. For example, I have a string:

str = "subscripción gustaría♥"

I want the output to be "subscripción gustaría".

The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.

str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'') 

This doesn't work. The last special character is not removed.

Can someone help? (This is ruby 1.8)

Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.

A: 
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
Adrian
No, this removed all the special characters. I want only ♥ to be removed and not ó and í
maheshmurthy
A: 

The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones? But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples: if str = "subscripción gustaría♥" is utf-8 then you can blacklist all char above the range (excl. whitespaces):

     str.gsub(/[^\x{0021}-\x{017E}\s]/,'')

if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:

    str.gsub(/[\x01-\x1F]/,'')

The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.

c64ification
Yeah, my bad, I should have mentioned it is utf-8 encoded. I see your point. I am trying to whitelist just 6 special characters. So, what I am trying to get to is "if not in the range 00-7F and not \xC3\xB3 and not \xC3\xA1", then replace it with blank. I get a syntax error when I tried your solution above. It doesn't like the curly braces.
maheshmurthy
Blacklisting is a bad idea. Who knows what could be out there. You're much better off saying exactly what you'll accept, that way there are no suprises.
Paul Rubel
Yes my bad too, I was thinking in PHP, so sorry for my bad regex.Look at Mark Wilkins's answer, I tested it and it worked in this very example.
c64ification
A: 

It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).

Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):

puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'') 

This next example specifies that characters 0xA1 and 0xC3 should be deleted.

puts str.gsub(/[\xA1\xC3]/,'') 
Mark Wilkins
+1  A: 

Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.

ruby-1.8.7-p248 > str = "subscripción gustaría♥"
 => "subscripci\303\263n gustar\303\255a\342\231\245" 
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
 => nil 
Marcos Toledo
A: 

I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.

maheshmurthy
Then you should check out my answer, it works on my Mac and doesn't match bytes, which could end up wrong for you.
Marcos Toledo