ansaurus

Question

Python - codec encoding ascii to unicode: error

Answer 1

A:

Your source is riddled with bytestrings and you're not using codecs.open().

Ignacio Vazquez-Abrams 2010-02-15 10:35:19

Answer 2

+1 A:

f1.write(' '.join(list1))

list1, at this point, contains Unicode strings. You can't write Unicode directly to a file, it's a byte interface. You should either encode it explicitly (' '.join(list1).encode('utf-8')), or, as Ignacio suggests, use a codecs wrapper to implicitly encode Unicode strings you send to it. At the moment you are defining a variable CODEC, but not doing anything with it.

bobince 2010-02-15 10:43:53

Answer 3

+1 A:

Are you sure you want to remove all the hyphens(-)? Looking at your input file, it looks like all replacements are two- or three-character codes, such as u'I-':u'इ'. If this is so, you could do something like below, but make sure you're using Unicode strings for all your keys and values in the dictionary:

import codecs

# read the whole file at once
f = codecs.open(input_file,'r','ascii')
data = f.read()
f.close()

# perform all the replacements
for k,v in english_hindi_dict.items():
    data = data.replace(k,v)

# write the whole file result
f = codecs.open(output_file,'w',CODEC)
f.write(data)
f.close()

Following that theory, I got the following result, which looks like translations such as 'z*', 't-', 'ng', and 'ei' are missing from the dictionary. I don't read Hindi, but Google Translate came up with some of the English words in your translation, so I think I'm on the right track.

-z*धिमैन पक्षी

एक घने जngगल मेng एक बहुt- ऊँचै पेड तै
उस की पt-z*t-ोng से लदी शैखैयेng मज*zबूt- बैजुओng की t-रह फeiली हुई तीng
वन हँसोng कै एक झुnhz*ड इस पेड पर निवैस करt-ै तै
वे सब यहैँ सुरक्षिt- ते ौर बडे आरैम से रहt-े ते
उन मेng से एक पक्षी बहुt- बुदz*धिमैन तै
इस बुदz*धिमैन पक्षी ने एक दिन पेड की जड मेng से एक लt-ै को उगt-े देखै 
इस के बैरे मेng उसने दूसरे पक्षियोng से बैt- की
"कz*यै t-ुमz*हेng वह लt-ै दिखैई देt-ी हei", उस ने उन से पूछै "t-ुमz*हेng इसे नShz*ट कर देनै चैहिए"
"इसे कz*योng नShz*ट कर देनै चैहिए?" हँसोng ने आशz*च*rय से पूछै "यह t-ो इt-नी छोटी से हei
हमेng यह कz*यै हैनि पहुँचै सकt-ी हei"
"मेरे मित्रोng," बुदz*धिमैन पक्षी ने उt-z*t-र दियै "वह छोटी सी लt-ै जलz*दी ही बडी हो जैयेगी
यह हमैरे पेड पर चढ*z कर उस से लिपटt-ी जैयेगी ौर फिर मोटी ौर मज*zबूt- हो जैयेगी"
"t-ो कz*यै हुआ"

Mark Tolonen 2010-02-15 19:09:29

Thank you for your answer Sir.. I will get back to you Sir in a some time.. am I kinda bit held up...

mgj 2010-02-21 23:47:21

Answer 4

+2 A:

You have a few problems other than the one which you asked about.

(1) A conceptual problem: "E-k- b-u-d-z*dhi-m-aan- p-ksii#" is not "english". It is Hindi language written in ASCII using some romanization scheme. It looks like ITRAN but ITRAN doesn't have AA and A, it has only aa and a. Does the scheme have a name? Can you supply a URL? Your object is better described as "transliterate some Hindi text from the unnamed romanization to Devanagari script".

(2) Showing the result of translating your text from Hindi to English ("A WISE OLD BIRD" etc) is only moderately useful. The expected Devanagari output would be a better idea.

(3) As remarked by @kaiser.se, the transliteration dictionary has multi-byte (up to 3 bytes!) keys, some of which are prefixes of others. Presumably AA must be recognised in priority to A, gh must be recognised before g, etc. Iterating over the items of a dictionary happens in an order that is predictable but for your purposes should be regarded as random. In the code that follows, I've given priority to longer "keys".

(4) Either the dictionary is missing some letter keys (a S t z) or the transliteration rules are more complicated than any of us has guessed so far

(5) The meaning of the characters # * and - is not 100% obvious. It appears from your input text that z and * appear only in combination as z*

(6) It would be a good idea if you explained the interpretation of e.g. shaakhaay-e-ng ... does it start with sh then aa or does it start with sha then a? What are the rules?

The answer to the problem that you asked about is of course as several others have pointed out that you need to encode your unicode output using an encoding that is supported by your display device e.g. UTF-8.

Here's some code:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

input_data = """
E-k- b-u-d-z*dhi-m-aan- p-ksii#

E-k- ghn-e- j-ngg-l- m-e-ng E-k- b-h-u-t- UUNNc-aa p-e-dr thaa#
[snip]
"t-o- k-z*y-aa h-u-AA"#
"""

roman_devanagari_dict={'A' : u'अ' ,  'AA' : u'आ ' , 'I' : u'इ' , 'II' : u'ई ' , 'U' : u'उ ' ,\
[snip]
            '2' : u'२' , '5' : u'५' , '3' : u'३' , '7' : u'७' , '9' : u'९' , '1' : u'१'}

#Presuming we need to do the 3-letter cases then the 2-letter then the 1-letter
replacements = [(-len(k), unicode(k), v) for k, v in roman_devanagari_dict.items()]
replacements.sort()

data = input_data.decode('ascii')

for _junk, from_text, to_text in replacements:
    data = data.replace(from_text, to_text)

# Presuming the '-' are inter-character markers, delete them last, not first
data = data.replace(u'-', '')
data = data.replace(u'#', '')
print "untransliterated:", set(c for c in data if 0x20 < ord(c) < 0x7f)

BOM = u'\ufeff'
outf = open('devanagari.txt', 'w')
outf.write(BOM.encode('utf8')) # for the benefit of clueless Windows s/w
outf.write(data.encode('utf8'))
outf.close()

Output:

एक बुदz*धिमैन पक्षी

एक घने जनगगल मेनग एक बहुt ऊँचै पेड थa उ स की पtz*tोनग से लदी षaखैयेनग मज*zबूt बैजुओनग की tरह फेिली हुई तीनग वन हँसोनग कै एक झुनहz*ड इस पेड पर निवैस करtै थa वे सब यहैँ सुरक्षिt ते ौर बडे आ रैम से रहtे ते उ न मेनग से एक पक्षी बहुt बुदz*धिमैन थa इस बुदz*धिमैन पक्षी ने एक दिन पेड की जड मेनग से एक लtै को उ गtे देखै इस के बैरे मेनग उ सने दूसरे पक्षियोनग से बैt की "कz*यै tुमz*हेनग वह लtै दिखैई देtी हेि", उ स ने उ न से पूछै "tुमz*हेनग इसे नSहz*ट कर देनै चैहिए" "इसे कz*योनग नSहz*ट कर देनै चैहिए?" हँसोनग ने आ शz*च*रय से पूछै "यह tो इtनी छोटी से हेि हमेनग यह कz*यै हैनि पहुँचै सकtी हेि" "मेरे मित्रोनग," बुदz*धिमैन पक्षी ने उ tz*tर दियै "वह छोटी सी लtै जलz*दी ही बडी हो जैयेगी यह हमैरे पेड पर चढ*z कर उ स से लिपटtी जैयेगी ौर फिर मोटी ौर मज*zबूt हो जैयेगी" "tो कz*यै हुआ "

which has only a few recognisable words when shoved through Google Translate.

Update after examining the transliteration table more closely:

Three of the entries (AA, II, and U) have a space after the Devanagari equivalent. Perhaps the spaces should be removed.
The general pattern for consonants appears to be:

DEVANAGARI LETTER XA is represented by x
DEVANAGARI LETTER XXA is represented by X
DEVANAGARI LETTER XHA is represented by xh
DEVANAGARI LETTER XXHA is represented by Xh

However 3 entries break the pattern:
SSA -> sha but pattern says S
TA -> th but pattern says t
THA -> tha but pattern says th

Note: changing the above 3 entries stopped my code from complaining that S and t were left unchanged when transliterating your sample text, and removed the seemingly-anomalous sha and tha entries.

Entries (D and dr) are mapped to the same character, DEVANAGARI LETTER DDA. D is the expected entry for that character; perhaps dr should be mapped elsewhere.
There is no entry for DEVANAGARI LETTER NGA (U+0919); perhaps it should be encoded as ng -- there are a few words ending in ng in the sample text.
Are the uncatered-for "z*" occurrences in the sample text anything to do with DEVANAGARI LETTER ZA (U+095B)?

John Machin 2010-02-16 02:04:54

Hi John...:) First of all many thanks for your valuable time and help, I am sorry about not yet accepting any of the answers till now, I have been a bit held up recently with related work will surely get back to you asap, hope you understand....

mgj 2010-02-21 23:46:24

ansaurus

tags:

views:

answers:

Python - codec encoding ascii to unicode: error

related questions