views:

638

answers:

4

I have a Ruby CGI (not rails) that picks photos and captions from a web form. My users are very keen on using smart quotes and ligatures, they are pasting from other sources. My web app does not deal well with these non-ASCII characters, is there a quick Ruby string manipulation routine that can get rid of non-ASCII chars?

A: 

Quick GS revealed this discussion which suggests the following method:

class String
  def remove_nonascii(replacement)
    n=self.split("")
    self.slice!(0..self.size)
    n.each { |b|
     if b[0].to_i< 33 || b[0].to_i>127 then
       self.concat(replacement)
     else
       self.concat(b)
     end
    }
    self.to_s
  end
end
Joe
Yes, I found that but it does not deal with unicode double byte chars right? Well, I will test this one, thanks for the help!
A: 

No there isn't short of removing all characters beside the basic ones (which is recommended above). The best slution would be handling these names properly (since most filesystems today do not have any problems with Unicode names). If your users paste in ligatures they sure as hell will want to get them back too. If filesystem is your problem, abstract it away and set the filename to some md5 (this also allows you to easily shard uploads into buckets which scan very quickly since they never have too many entries).

Julik
+2  A: 

class String
 def remove_non_ascii(replacement="") 
   self.gsub(/[\x80-\xff]/,replacement)
 end
end
klochner
+1  A: 

Here's my suggestion using Iconv.

class String
  def remove_non_ascii
    require 'iconv'
    Iconv.conv('ASCII//IGNORE', 'UTF8', self)
  end
end
Scott