tags:

views:

89

answers:

2

OK, I have a small test file that contains utf-8 codes. Here it is (the language is Wolof)

Fˆndeen d‘kk la bu ay wolof aki seereer a fa nekk. DigantŽem ak
Cees jur—om-benni kilomeetar la. MbŽyum gerte ‘pp ci diiwaan bi mu

that is what it looks like in a vanilla editor, but in hex it is:

xxd test.txt
0000000: 46cb 866e 6465 656e 2064 e280 986b 6b20  F..ndeen d...kk 
0000010: 6c61 2062 7520 6179 2077 6f6c 6f66 2061  la bu ay wolof a
0000020: 6b69 2073 6565 7265 6572 2061 2066 6120  ki seereer a fa 
0000030: 6e65 6b6b 2e20 4469 6761 6e74 c5bd 656d  nekk. Digant..em
0000040: 2061 6b0d 0a43 6565 7320 6a75 72e2 8094   ak..Cees jur...
0000050: 6f6d 2d62 656e 6e69 206b 696c 6f6d 6565  om-benni kilomee
0000060: 7461 7220 6c61 2e20 4d62 c5bd 7975 6d20  tar la. Mb..yum 
0000070: 6765 7274 6520 e280 9870 7020 6369 2064  gerte ...pp ci d
0000080: 6969 7761 616e 2062 6920 6d75 0d0a       iiwaan bi mu..

The second character [cb86] is a non-standard coding for a-grave [à] which is found quite consistently in web documents, although in 'real' utf-8, a-grave would be c3a0. Real utf-8 works beautifully on Macs and under Windows.

I handle the fake utf-8 by using a character map which included the pair { ˆ à } because that little caret is what cb86 generates, and everything works fine ON A MAC for displaying text (in a text widget) like this:

Fàndeen dëkk la bu ay wolof aki seereer a fa nekk. Digantéem ak
Cees juróom-benni kilomeetar la. Mbéyum gerte ëpp ci diiwaan bi mu

On a PC - using the same file (shared) the first three characters read in are 46 cb 20 (using no fconfigure). I have run through ALL the possible encodings and can never get the same map to work. [There are twenty that will allow 46 cb 86]

Sorry this is so long, but if anyone has a clue, I would love to hear it.

Tel Monks

+1  A: 

I don't know Wolof at all. However, I'm sure that the problem you've got is that you've got a file that is in a mixed encoding, with non-standard code points (instead of standard Unicode) and then a conversion to bytes using the UTF-8 scheme. This is messy!

The way to deal with this is to first read the bytes into Tcl using a channel that is configured to use the utf-8 encoding:

set f [open $filename]
fconfigure $f -encoding utf-8
set contents [read $f]
close $f

Then, you need to apply a transformation using string map that converts the “wrong” characters to the right ones. For example, this would do it (as far as I can tell) for the specific characters you listed:

set mapping {"\u02c6" "\u00e0"}
set fixed [string map $mapping $contents]
# You should now be able to do anything you want with $fixed

However, that might be all wrong! The problem is that I don't know what the contents of the file should be (at the level of characters, not bytes). Which gets back to my comment “I don't know Wolof at all”.

Update

Now that dan04 has identified what had been done to that poor text, I can provide how to decode. Read the code in as above, but now we use a different mapping step:

set fixed [encoding convertfrom macRoman [encoding convertto cp1252 $content]]

On the sample supplied, that produces the expected output.

Donal Fellows
Thanks all for the help, but...These files were originally on a PC, not a Mac. It so happens that the Mac has no trouble with them, but the PC does. I do use the fconfigure with -encoding utf-8 - does not help.I have determined that the "right" thing is happening to output files, but they will not displayu correctly on a PC, nor in the colsole window, nor in a text widget.I will try the convertfrom and convertto (athough why MacRoman?) and let you know how it turns out.
Tel Monks
that works ! Thanks, but can you explain it to me? Since there was never a Mac involved before me...
Tel Monks
The file sounds like it started out on an *old* Mac, then was shared with Windows where someone converted it to UTF-8 without understanding (or even looking at) the contents. The converted file (on your *new* Mac) is what you've now got. How this all came about, I have no idea; welcome to the strange life of data! :-)
Donal Fellows
Thanks again for getting me over this hump.
Tel Monks
A: 

The data was originally encoded using a Mac encoding (most likely Roman, but Turkish and Icelandic are also possible for this example), misinterpreted as windows-1252, and then correctly converted to UTF-8.

dan04