views:

398

answers:

2

We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:

mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql

That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.

I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.

Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.

Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.

Thanks.

+2  A: 

Looks like "smart quotes" have claimed yet another victim.

MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.

There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.

James Anderson
Thank you for replying! I'm glad to know about the demoroniser (and its cousin, the unmoroniser). Near as I can tell, these tools deal with HTML entities ("I wrote a Perl program, the demoroniser, to transform Microsoft's "junk HTML" into at least a starting point for something I'd consider presentable on my site"). But in the data I'm looking at (in vi), the malformed characters aren't HTML; they're question marks. In any event, running demoroniser made no improvement.I think I need to run something like demoroniser against the characters themselves. Does that make sense?
Hoosteeno
Demoroniser conatins the basic logic to get rid of them. (If indeed this is the problem!).HAve a look at the transalations the perl program is doing. Then "od -x " your file to see if these are the characters you are dealing with.
James Anderson
A: 

It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.

Kibbee
Thanks! I'd seen that page, but you reminded me to try again. I wrote a script to set mdbtools' charset to every possible value in `iconv -l`, run mdbtools, and check a specific phrase in the result for correctness. Unfortunately, no charset changed the output of mdbtools.
Hoosteeno