It looks like a case of garbage in, garbage out. Here are a few clues on how to see what you've got in your data. repr()
and unicodedata.name()
are your friends.
>>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
>>> print repr(s.decode('utf8'))
u' mwen bezwen \xe3 \xa8 d medikal '
>>> import unicodedata
>>> unicodedata.name(u'\xe3')
'LATIN SMALL LETTER A WITH TILDE'
>>> unicodedata.name(u'\xa8')
'DIAERESIS'
>>>
Update:
If (as A. N. Other implies) you are letting the package choose the output language at random, and you suspect its choice is e.g. Korean (a) tell us (b) try to decode the output using a codec that's relevant to that language .... here are not only Korean but also two each of Chinese, Japanese, and Russian:
>>> s = ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
>>> for enc in 'euc-kr big5 gb2312 shift-jis euc-jp cp1251 koi8-r'.split():
print enc, s.decode(enc)
euc-kr mwen bezwen 찾 짢 d medikal
big5 mwen bezwen 瓊 穡 d medikal
gb2312 mwen bezwen 茫 篓 d medikal
shift-jis mwen bezwen テ」 ツィ d medikal
euc-jp mwen bezwen 達 即 d medikal
cp1251 mwen bezwen ГЈ ВЁ d medikal
koi8-r mwen bezwen цё б╗ d medikal
>>>
None very plausible, really, especially the koi8-r. Further suggestions: Inspect the documentation of the package you interfacing with (URL please!) ... what does it say about encoding? Between which two languages are you trying it? Does "mwen bezwen" make any sense in the expected output language? Try a much larger sample of text -- does chardet still indicate UTF-8? Does any of the larger output make sense in the expected output language? Try it translating English to another language that uses only ASCII -- do you get meaningful ASCII output? Do you care to divulge your Python code and your swig interface code?
update 2 The information flow is interesting: "a string processing app" -> "a statistical language translation system" -> "a machine translation system (opensource/freesoftware) to help out in haiti (crisiscommons.org)"
Please try to replace "unknown" by the facts in the following:
Input language: English (guess)
Output language: Haitian Creole
Operating system: linux
Python version: unknown
C++ package name: unknown
C++ package URL: unknown
C++ package output encoding: unknown
Test 1 input: unknown
Test 1 expected output: unknown
Test 1 actual output (utf8): ' mwen bezwen \xc3\xa3 \xc2\xa8 d medikal '
[Are all of those internal spaces really in the string?]
Test 2 input: 'I need medical aid.'
Test 2 expected output (utf8): 'Mwen bezwen \xc3\xa8d medikal.'
Test 2 actual output (utf8): unknown
Test 2 obtained from both Google Translate (alpha) and
Microsoft Translate (beta):
Mwen bezwen èd medikal
.
The third word is LATIN SMALL LETTER E with GRAVE (U+00E8) followed by 'd'.
Update 3
You said """input: utf8 (maybe, i think a couple of my files might have improperly coded text in them) """
Assuming (you've never stated this explicitly) that all your files should be encoded in UTF-8:
The zip file of aligned en-fr-ht corpus has several files that crash when one attempts to decode them as UTF-8.
Diagnosis of why this happens:
chardet is useless (in this case); it faffs about for a long time and comes back with a guess of ISO-8859-2 (Eastern Europe aka Latin2) with a confidence level of 80 to 90 pct.
Next step: chose the ht-en directory (ht uses fewer accented chars than fr therefore easier to see what is going on).
Expectation: e-grave is the most frequent non-ASCII character in presumed-good ht text (a web site, CMU files) ... about 3 times as many as the next one, o-grave. The 3rd most frequent one is lost in the noise.
Got counts of non-ascii bytes in file hten.txt. Top 5:
8a 99164
95 27682
c3 8210
a8 6004
b2 2159
The last three rows are explained by
e-grave is c3 a8 in UTF-8
o-grave is c3 b2 in UTF-8
2159 + 6004 approx == 8210
6004 approx == 3 * 2159
The first 2 rows are explained by
e-grave is 8a in old Western Europe DOS encodings like cp850!!
o-grave is 95 in old Western Europe DOS encodings like cp850!!
99164 approx == 3 * 27682
Explanations that include latin1 or cp1252 don't hold water (8a is a control character in latin1; 8a is S-caron in cp1252).
Inspection of the contents reveals that the file is a conglomeration of multiple original files, some UTF-8, at least one cp850 (or similar). The culprit appears to be the Bible!!!
The mixture of encodings explains why chardet was struggling.
Suggestions:
(1) Implement checking of encoding on all input files. Ensure that they are converted to UTF-8 right up front, like at border control.
(2) Implement a script to check UTF-8 decodability before release.
(3) The orthography of the Bible text appears (at a glance) to be different to that of websites (many more apostrophes). You may wish to discuss with your Creole experts whether your corpus is being distorted by a different orthography ... there is also the question of the words; do you expect to get much use of unleavened bread and sackcloth & ashes? Note the cp850 stuff appears to about 90% of the conglomeration; some Bible might be OK but 90% seems over the top.
(4) Why is Moses not complaining about non-UTF-8 input? Possibilities: (1) it is working on raw bytes i.e. it doesn't convert to Unicode (2) it attempts to convert to Unicode, but silently ignores failure :-(