views:

379

answers:

3

Sometimes when copying stuff into PostgreSQL I get errors that there's invalid byte sequences.

Is there an easy way using either vim or other utilities to detect byte sequences that cause errors such as: invalid invalid byte sequence for encoding "UTF8": 0xde70 and whatnot, and possibly and easy way to do a conversion?

Edit:

What my workflow is:

  1. Dumped sqlite3 database (from trac)
  2. Trying to replay it in postgresql

Perhaps there's an easier way?

More Edit:

Also tried these:

  1. Running enca to detect encoding of the file

    • Told me it was ASCII
  2. Tried iconv to convert from ASCII to UTF8. Got an error

What did work is deleting the couple erroneous lines that it complained about. But that didn't really solve the real problem.

+4  A: 

Based on one short sentence, it sounds like you have text in one encoding (e.g. ANSI/ASCII) and you are telling PostgreSQL that it's actually in another encoding (Unicode UTF8). All the different tools you would be using: PostgreSQL, Bash, some programming language, another programming language, other data from somewhere else, the text editor, the IDE, etc., all have default encodings which may be different, and some step of the way, the proper conversions are not being done. I would check the flow of data where it crosses these kinds of boundaries, to ensure that either the encodings line up, or the encodings are properly detected and the text is properly converted.

Justice
Just tried using iconv to try converting from ASCII to UTF8. I think it's already in UTF8 because I got errors from using ASCII.Tried iconv -f UTF8 -t UTF8 and didn't get any errors. Perhaps something is being escaped that it shouldn't in the dump file.
mikelikespie
It can't be in UTF8 already if PostgrSQL won't accept it as UTF8. There are plenty of other possible encodings it could be in. (7-bit ASCII vs 8-bit ANSO, or maybe UTF16 or any of the hundreds of other encodings in existence. There is nothing being "escaped" at the encoding level.
jalf
ASCII itself is 7-bit, and perfectly valid UTF-8. Perhaps something's using an 8-bit ASCII extension. There's plenty of them out there, and they aren't valid UTF-8.
David Thornley
+1  A: 

If you know the encoding of the dump file, you can convert it to utf-8 by using recode. For example, if it is encoded in latin-1:

recode latin-1..utf-8 < dump_file > new_dump_file

If you are not sure about the encoding, you should see how sqlite was configured, or maybe try some trial-and-error.

Roberto Bonvallet
+1  A: 

I figured it out. It wasn't really an encoding issue.

SQLite's output escaped strings differently than Postgres expects. There were some cases where 'asdf\xd\foo' was outputted. I believe the '\x' was causing it to expect the following characters to be unicode encoding.

Solution to this is dumping each table individually in CSV mode in sqlite 3.

First

sqlite3 db/trac.db .schema | psql

Now, this does the trick for the most part to copy the data back in

for table in `sqlite3 db/trac.db .schema | grep TABLE | sed 's/.*TABLE \(.*\) (/\1/'`
do              
echo ".mode csv\nselect * from $table;" | sqlite3 db/trac.db | psql -c "copy $table from stdin with csv"
done

Yeah, kind of a hack, but it works.

mikelikespie