views:

476

answers:

2

I have a Rails application that allows users to import information from various sources using RSS feeds and such. My default encoding on the database is UTF8 and I've been receiving a lot of exceptions in regards to non-UTF8 data that is coming through the system and crashing once it hits the database.

I'm to appropriately detect the non-UTF8 data using the is_utf8? method on the attributes before a save is done, but I haven't come up with a way to handle it. I've seen iconv to convert but it appears that requires being able to determine what kind of encoding I'm converting from.

Is there a simple way to do a guess conversion or possibly just strip out the non-UTF8 characters and then do the save into the database?

Thanks!

A: 

Iconv is your friend when it comes to switch encodings. To detect encodings there's a little gem available: rchardet We have used it to detect Asian encodings in an attempt to block spam and it worked fine.

pantulis
+1  A: 

How is non-UTF-8 data making it into the system? Make sure all your pages are served as Content-Type text/html;charset=utf-8 and browsers will always submit UTF-8 data to your forms.

(Of course that still leaves things like mail and uploaded files, but a lot of those kinds of specific context often give you an encoding to go on.)

bobince