views:

514

answers:

1

I'm currently playing a bit with couchdb.
I'm trying to migrate some blog data from redis (key value store) to couchdb (key value store).
Seeing as I probably migrated this data a gazillion times from and to different blogging engines (everybody has got to have a hobby :) ), there seem to be some encoding snafus.
I'm using CouchREST to access CouchDB from ruby and I'm getting this:

<JSON::GeneratorError: source sequence is illegal/malformed>

the problem seems to be the body_html part of the object:

<Post:0x00000000e9ee18 @body_html="[.....]Wie Sie bereits wissen, m\xF6chte EUserv k\xFCnftig seine  [...]

Those are supposed to be Umlauts ("möchte" and "künftig").

Any idea how to get rid of those problems? I tried some conversions using the ruby 1.9 encoding feature or iconv before inserting, but haven't got any luck yet :(

If I try to e.g. convert that stuff to ISO-8859-1 using the .encode() method of ruby 1.9, this is what happens (different text, same problem):

#<Encoding::UndefinedConversionError: "\xC6\x92" from UTF-8 to ISO-8859-1>
+4  A: 

I try to e.g. convert that stuff to ISO-8859-1

Close. You actually want to do it the other way around: you've got ISO-8859-1(*), you want UTF-8(**). So str.encode('utf-8', 'iso-8859-1') would be more likely to do the trick.

*: actually you might well have Windows code page 1252, which is like ISO-8859-1, but with extra smart-quotes and things in the range 0x80-0x9F which ISO-8859-1 uses for control codes. If so, use 'cp1252' instead.

**: well, you probably do. Working with UTF-8 is the best way forward so you can store all possible characters. If you really want to keep working in ISO-8859-1/cp1252, then presumably the problem is just that Ruby has mis-guessed the character set in use and you can fix it by calling str.force_encoding('iso-8859-1').

bobince
Thanks! I always mix the encoding stuff up :( This probably was iso-8859-1 but somehow along the way, it got declared UTF8. This helped: post.body_html.force_encoding('iso-8859-1').encode("utf-8")
Marc Seeger
Cool! Yep, that would do the same thing.
bobince