tags:

views:

53

answers:

2

I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the command line:

>>> domain = u"pfarmerü.com"
>>> domain
u'pfarmer\xfc.com'
>>> domain.encode("idna")
'xn--pfarmer-t2a.com'
>>> 

I'm struggling to get it to work with a small script reading data from the text file.

#!/usr/bin/python

import sys

infile = open(sys.argv[1])

for line in infile:
    print line,
    domain = unicode(line.strip())
    print type(domain)
    print "IDN:", domain.encode("idna")
    print

I get the following output:

$ ./idn.py ./test 
pfarmer.com
<type 'unicode'>
IDN: pfarmer.com

pfarmerü.com
Traceback (most recent call last):
  File "./idn.py", line 9, in <module>
    domain = unicode(line.strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)

I have also tried:

#!/usr/bin/python

import sys
import codecs

infile = codecs.open(sys.argv[1], "r", "utf8")

for line in infile:
    print line,
    domain = line.strip()
    print type(domain)
    print "IDN:", domain.encode("idna")
    print

Which gave me:

$ ./idn.py ./test       
Traceback (most recent call last):
  File "./idn.py", line 8, in <module>
    for line in infile:
  File "/usr/lib/python2.6/codecs.py", line 679, in next
    return self.reader.next()
  File "/usr/lib/python2.6/codecs.py", line 610, in next
    line = self.readline()
  File "/usr/lib/python2.6/codecs.py", line 525, in readline
    data = self.read(readsize, firstline=True)
  File "/usr/lib/python2.6/codecs.py", line 472, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: unsupported Unicode code range

Here is my test data file:

pfarmer.com
pfarmerü.com

I'm very aware of my need to understand unicode now.

Thanks,

Peter

+4  A: 

you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.

Then you can do (assuming 'utf-8'):


infile = open(sys.argv[1])

for line in infile:
    print line,
    domain = line.strip().decode('utf-8')
    print type(domain)
    print "IDN:", domain.encode("idna")
    print

Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.

knitti
+2  A: 

Your first example is fine, except that:

domain = unicode(line.strip())

you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.

However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.

Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.

bobince