views:

495

answers:

3

I've been working on a statistical translation system for haiti (code.google.com/p/ccmts) that uses a C++ backend (http://www.statmt.org/moses/?n=Development.GetStarted) and Python drives the C++ engine/backend.

I've passed a UTF-8 Python string into a C++ std::string, done some processing, gotten a result back into Python and here is the string (when printed from C++ into a Linux terminal):

mwen bezwen ã ¨ d medikal

  1. What encoding is that? Is it a double encoded string?
  2. How do I "fix it" so it's renderable?
  3. Is that printed in that fashion because I'm missing a font or something?

The Python chardet library says:

{'confidence': 0.93812499999999999, 'encoding': 'utf-8'}

but, Python, when I run a string/unicode/codecs decode gives me the old:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 30: ordinal not in range(128)

Oh and Python prints that same exact string into standard output.

A repr() call prints the following: ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '

+1  A: 

Looks like your default encoding is ASCII.

You can either explicitly convert your unicode strings:

print u"Hellö, Wörld".encode("utf-8")

Or, if you want to change this globally in your script, replace sys.stdout with a wrapper that encodes it as utf-8:

import sys, codecs
sys.stdout = codecs.EncodedFile(sys.stdout, "utf-8")
print u"Hellö, Wörld!"

Furthermore, you can change the default encoding once and for all (site-wide) via sys.setdefaultencoding, but this can only be done in sitecustomize.py. I would't do this, however -- convenient as it may seem, it affects all python scripts on your system, and might have unintended side-effects.

oefe
I'm still getting a: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128) for 'mwen bezwen ã ¨ d medikal' wow this is annoying!
ct
It might be helpful if you posted what 'mwen bezwen ã ¨ d medikal' actually should be, and also the `repr`of the resulting unicode string (to check that it is correct)
oefe
repr() output: ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
ct
unfortunately, i have no idea what it's supposed to be - it's a statistical language translation system, so the output isn't really something i'm familiar with...
ct
So what you got back is a `string`, not a `unicode` object. Now, the first thing which you need to know is its encoding. You seem to assume that it is utf-8, but this would be "ã¨". That's not very likely; however, it is also unlikely that some random bytes happen to be valid utf-8, as those are. My guess is that you already sent garbage into the c++ routine, and receive garbage back. Maybe you didn't decode or encode the input data properly.You should also really know what the data should be, otherwise it's a lot of unnecessary guesswork.
oefe
the big question is this - does std:string handle utf-8? or should i be using std::wstring?
ct
python is copying a unicode string into std::string.
ct
@ct: EDIT YOUR QUESTION, don't show important info in comments! "copying a unicode string into std::string" is not much help, only worrying; SHOW YOUR CODE! Tell us where the C++ package can be inspected! Give us a few examples of (input, repr(expected_output), repr(actual_output)) where the expected output contains accented characters and we may be able to detect what is causing the mangling.
John Machin
+1  A: 

Edit: Nevermind that junk I posted before; it was wrong.

As others have suggested, this will get you the correct unicode object in python, assuming that's meant to be utf-8:

>>> ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '.decode('utf-8')
u' mwen bezwen \xe3 \xa8 d medikal '
>>> print _
 mwen bezwen ã ¨ d medikal

It really does seem to be a case of your library giving you garbage, whether garbage when into it or not.

Jorenko
>>> guff = '\xc3\xa3\xc2\xa8'>>> print guff.decode('utf-16be')쎣슨 ... so each unicode character is encoded in ASCII if possible otherwise UTF-16BE? Bit hard to believe!
John Machin
Oh, nevermind, I see what I did wrong -- the unicode character isn't just the concatenation of the encoding bytes, but the non-control bits of them. Fixing for that, both my examples are the same!
Jorenko
ct
If you truly believe that it's real utf-8, then ''.decode('utf-8') is correct. If it could be another encoding, you need to find out which one and look at John's answer.
Jorenko
@ct: My recommendation is that you answer simple questions when asked, and answer them by editing your question.
John Machin
+3  A: 

It looks like a case of garbage in, garbage out. Here are a few clues on how to see what you've got in your data. repr() and unicodedata.name() are your friends.

>>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
>>> print repr(s.decode('utf8'))
u' mwen bezwen \xe3 \xa8 d medikal '
>>> import unicodedata
>>> unicodedata.name(u'\xe3')
'LATIN SMALL LETTER A WITH TILDE'
>>> unicodedata.name(u'\xa8')
'DIAERESIS'
>>>

Update:

If (as A. N. Other implies) you are letting the package choose the output language at random, and you suspect its choice is e.g. Korean (a) tell us (b) try to decode the output using a codec that's relevant to that language .... here are not only Korean but also two each of Chinese, Japanese, and Russian:

>>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
>>> for enc in 'euc-kr big5 gb2312 shift-jis euc-jp cp1251 koi8-r'.split():
    print enc, s.decode(enc)


euc-kr  mwen bezwen 찾 짢 d medikal 
big5  mwen bezwen 瓊 穡 d medikal 
gb2312  mwen bezwen 茫 篓 d medikal 
shift-jis  mwen bezwen テ」 ツィ d medikal 
euc-jp  mwen bezwen 達 即 d medikal 
cp1251  mwen bezwen ГЈ ВЁ d medikal 
koi8-r  mwen bezwen цё б╗ d medikal 
>>> 

None very plausible, really, especially the koi8-r. Further suggestions: Inspect the documentation of the package you interfacing with (URL please!) ... what does it say about encoding? Between which two languages are you trying it? Does "mwen bezwen" make any sense in the expected output language? Try a much larger sample of text -- does chardet still indicate UTF-8? Does any of the larger output make sense in the expected output language? Try it translating English to another language that uses only ASCII -- do you get meaningful ASCII output? Do you care to divulge your Python code and your swig interface code?

update 2 The information flow is interesting: "a string processing app" -> "a statistical language translation system" -> "a machine translation system (opensource/freesoftware) to help out in haiti (crisiscommons.org)"

Please try to replace "unknown" by the facts in the following:

Input language: English (guess)
Output language: Haitian Creole
Operating system: linux
Python version: unknown
C++ package name: unknown
C++ package URL: unknown
C++ package output encoding: unknown

Test 1 input: unknown
Test 1 expected output: unknown
Test 1 actual output (utf8): ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
[Are all of those internal spaces really in the string?]

Test 2 input: 'I need medical aid.'
Test 2 expected output (utf8): 'Mwen bezwen \xc3\xa8d medikal.'
Test 2 actual output (utf8): unknown

Test 2 obtained from both Google Translate (alpha) and Microsoft Translate (beta):
Mwen bezwen èd medikal.
The third word is LATIN SMALL LETTER E with GRAVE (U+00E8) followed by 'd'.

Update 3

You said """input: utf8 (maybe, i think a couple of my files might have improperly coded text in them) """

Assuming (you've never stated this explicitly) that all your files should be encoded in UTF-8:

The zip file of aligned en-fr-ht corpus has several files that crash when one attempts to decode them as UTF-8.

Diagnosis of why this happens:

chardet is useless (in this case); it faffs about for a long time and comes back with a guess of ISO-8859-2 (Eastern Europe aka Latin2) with a confidence level of 80 to 90 pct.

Next step: chose the ht-en directory (ht uses fewer accented chars than fr therefore easier to see what is going on).

Expectation: e-grave is the most frequent non-ASCII character in presumed-good ht text (a web site, CMU files) ... about 3 times as many as the next one, o-grave. The 3rd most frequent one is lost in the noise.

Got counts of non-ascii bytes in file hten.txt. Top 5:

8a 99164
95 27682
c3 8210
a8 6004
b2 2159

The last three rows are explained by

e-grave is c3 a8 in UTF-8
o-grave is c3 b2 in UTF-8
2159 + 6004 approx == 8210
6004 approx == 3 * 2159

The first 2 rows are explained by

e-grave is 8a in old Western Europe DOS encodings like cp850!!
o-grave is 95 in old Western Europe DOS encodings like cp850!!
99164 approx == 3 * 27682

Explanations that include latin1 or cp1252 don't hold water (8a is a control character in latin1; 8a is S-caron in cp1252).

Inspection of the contents reveals that the file is a conglomeration of multiple original files, some UTF-8, at least one cp850 (or similar). The culprit appears to be the Bible!!!

The mixture of encodings explains why chardet was struggling.

Suggestions:

(1) Implement checking of encoding on all input files. Ensure that they are converted to UTF-8 right up front, like at border control.

(2) Implement a script to check UTF-8 decodability before release.

(3) The orthography of the Bible text appears (at a glance) to be different to that of websites (many more apostrophes). You may wish to discuss with your Creole experts whether your corpus is being distorted by a different orthography ... there is also the question of the words; do you expect to get much use of unleavened bread and sackcloth & ashes? Note the cp850 stuff appears to about 90% of the conglomeration; some Bible might be OK but 90% seems over the top.

(4) Why is Moses not complaining about non-UTF-8 input? Possibilities: (1) it is working on raw bytes i.e. it doesn't convert to Unicode (2) it attempts to convert to Unicode, but silently ignores failure :-(

John Machin
Being Ruassian, I can tell that actually that koi8-r is quite funny --"цё" is kind of ("da" (the slang "the")), and as for the "б╗", well there is an extremely obscene word starting with "б╗" (so, this reads as "б...unprintable") :))) But the whole thing is still meningless.
mlvljr
i'm building a machine translation system (opensource/freesoftware) to help out in haiti (crisiscommons.org) - unforuantely, i'm not a native speaker and i'm pretty sure the 2 characters that are acting wacky are french accented chars an advice on how to handle that charset?
ct
@ct: "http://traduiapp.com/"? If it is open-source, may be you could point to the source part that causes trouble?
mlvljr
@ct: Bondye!!! Why didn't you say so right at the start?? We don't have crystal balls. Yeah so the output is an attempt at Haitian Creole (mwen bezwen / mon besoin / my need); "handle charset" like French. Standard Unicode advice. Native speaker or not is irrelevant. Edit your question: What version of Python, what's in sys.stdout.encoding, URL for the C++ package, tell us how the English to Creole dictionary is stored (how is the Creole encoded?), show us your code, trace the translation of a couple of very simple English sentences into Creole -- see where it is going wrong, ...
John Machin
@John Machin: that's the site. http://code.google.com/p/ccmts/ - i have the text files i'm using in svn and you can download the corpus off the site. as for the python/c++ MT code, we're using moses NLP - i've had some issues figuring out how to get my moses hacks into svn (my repo or moses' repo) and sorting out administrativia with the project is very time consuming, makes it harder to work on the project itself. i am going to get the system up on the net this weekend. i've also had a lot of issues with too many chefs in the kitchen.
ct
@John Machin: input: utf8 (maybe, i think a couple of my files might have improperly coded text in them) expected output: i have no idea (it's a statistical system, so i have no clue what to expect, it uses pattern matching from text files to build heuristics that it uses to conduct the translation, this is not a simple dictionary look up problem), as for the spaces, that's what repr printed...moses is here -> http://www.statmt.org/moses/?n=Development.GetStarted
ct
@ct: Which part of "EDIT YOUR QUESTION" don't you understand? How will you know if the app works if you don't compare expected output (obtainable from a Creole expert, the CMU parallel text corpus, Google|Microsoft Translate, the Wikipedia article) with actual output?? Why TF don't you tell us the INPUT text that produced the gibberish? What are your "Moses hacks"? Which files might have improperly coded text? What makes you think so? Why haven't you run my Test 2?
John Machin