ansaurus

Question

Answer 1

+4 A:

Based on one short sentence, it sounds like you have text in one encoding (e.g. ANSI/ASCII) and you are telling PostgreSQL that it's actually in another encoding (Unicode UTF8). All the different tools you would be using: PostgreSQL, Bash, some programming language, another programming language, other data from somewhere else, the text editor, the IDE, etc., all have default encodings which may be different, and some step of the way, the proper conversions are not being done. I would check the flow of data where it crosses these kinds of boundaries, to ensure that either the encodings line up, or the encodings are properly detected and the text is properly converted.

Justice 2009-02-11 19:47:20

Just tried using iconv to try converting from ASCII to UTF8. I think it's already in UTF8 because I got errors from using ASCII.Tried iconv -f UTF8 -t UTF8 and didn't get any errors. Perhaps something is being escaped that it shouldn't in the dump file.

mikelikespie 2009-02-11 20:08:14

It can't be in UTF8 already if PostgrSQL won't accept it as UTF8. There are plenty of other possible encodings it could be in. (7-bit ASCII vs 8-bit ANSO, or maybe UTF16 or any of the hundreds of other encodings in existence. There is nothing being "escaped" at the encoding level.

jalf 2009-02-11 20:12:08

ASCII itself is 7-bit, and perfectly valid UTF-8. Perhaps something's using an 8-bit ASCII extension. There's plenty of them out there, and they aren't valid UTF-8.

David Thornley 2009-02-11 22:17:24

Answer 2

+1 A:

If you know the encoding of the dump file, you can convert it to utf-8 by using recode. For example, if it is encoded in latin-1:

recode latin-1..utf-8 < dump_file > new_dump_file

If you are not sure about the encoding, you should see how sqlite was configured, or maybe try some trial-and-error.

Roberto Bonvallet 2009-02-11 20:07:38

Answer 3

+1 A:

I figured it out. It wasn't really an encoding issue.

SQLite's output escaped strings differently than Postgres expects. There were some cases where 'asdf\xd\foo' was outputted. I believe the '\x' was causing it to expect the following characters to be unicode encoding.

Solution to this is dumping each table individually in CSV mode in sqlite 3.

First

sqlite3 db/trac.db .schema | psql

Now, this does the trick for the most part to copy the data back in

for table in `sqlite3 db/trac.db .schema | grep TABLE | sed 's/.*TABLE \(.*\) (/\1/'`
do              
echo ".mode csv\nselect * from $table;" | sqlite3 db/trac.db | psql -c "copy $table from stdin with csv"
done

Yeah, kind of a hack, but it works.

mikelikespie 2009-02-11 21:56:57

ansaurus

tags:

views:

answers:

Fixing Unicode Byte Sequences

related questions