views:

43

answers:

2

Suppose I type line = line.decode('gb18030;) and get the error

UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 142-143: illegal multibyte sequence

Is there a nice way to automatically get the error bytes? That is, is there a way to get 142 & 143 or line[142:144] from a built-in command or module? Since I'm fairly confident that there will be only one such error, at most, per line, my first thought was along the lines of:

for i in range(len(line)):
    try:    
        line[i].decode('gb18030')
    except UnicodeDecodeError:
        error = i

I don't know how to say this correctly, but gb18030 has variable byte length so this method fails once it gets to a Chinese character (2 bytes).

+1  A: 

Access the start and end attributes of the caught exception object.

u = u'áiuê©'
try:
  l = u.encode('latin-1')
  print repr(l)
  l.decode('utf-8')
except UnicodeDecodeError, e:
  print e
  print e.start, e.end
Ignacio Vazquez-Abrams
Beat me by 9 seconds. :-)
Omnifarious
Why so much code in the try clause? why not only have the `decode()` there?
EOL
+2  A: 
try:
    line = line.decode('gb18030')
except UnicodeDecodeError, e:
    print "Error in bytes %d through %d" % (e.start, e.end)
Omnifarious