[Python] Splitting results from chardet output to collect encoding type.

Hi All,

I am testing chardet in one of my scripts. I wanted to identify the encoding type of a result variable and chardet seems to do fine here.

So this is what I am doing:

myvar1 <-- gets its value from other functions

myvar2 = chardet.detect(myvar1) <-- to detect the encoding type of myvar1

Now when I do a print myvar2, I receive the output:

{'confidence': 1.0, 'encoding': 'ascii'}

Question 1: Can someone give pointer on how to collect only the encoding value part out of this, i.e. ascii.

Edit: The scenario is as follows:

I am using unicode(myvar1) to write all input as unicode. But as soon as myvar1 gets a value like 0xab, unicode(myvar1) fails with the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position xxx: ordinal not in range(128)

Therefore, I am tring to:

first identify the encoding type of the input which comes in myvar1,

take the encoding type in myvar2,

decode the input (myvar1) with this encoding (myvar2) using decode() [?]

pass it on to unicode.

The input coming in is variable and not in my control.

I am sure there are other ways to do this, but I am new to this. And I am open to trying.

Any pointer please.

Many Thanks.

Thnx. But the input coming to myvar1 is variable. I do not know specific encoding type for each value that comes in to myvar1. Currently, I am using unicode(myvar1) but I have found it to fail with certain types. Therefore, looking for chardet autodetect scheme.

sunshine 2010-10-25 13:14:39

@sunshine: You can't read a text you don't know specific encoding type for. It is ***impossible*** -- because the same byte sequence can mean different chars on different encodings. In other words, encodings are ambiguous. `chardet` is just a **guess**. It can and will fail in the wild. The best and **only** reliable way is to ask whoever generated the string which encoding was used in first place.

nosklo 2010-10-25 13:40:43

okay. could you pls see the scenario in my question above and share your thoughts?

sunshine 2010-10-25 13:46:20

@sunshine: I've added information at my answer above.

nosklo 2010-10-25 20:47:17

I understand. I checked the files at my end. There are 2 file types - 1 plain text and 2 xml. I did a cat -v file > newfile and then used this file into the script. Script still breaks with the same error.

sunshine 2010-10-26 19:59:16

I read on using BOM for similar issues. Any idea on that? I am thinking of doing something like adding an if { myvar1 = '0xab' then decode(myvar1, ('UTF-8')) action 1 else action 2 }... what do you think.

sunshine 2010-10-26 20:03:07

@sushine: as I have said,before, you need to know the actual encoding. Trying to decode it in some encoding and falling back to another isn't reliable, because the same byte can be valid in multiple encodings, meaning different chars. The BOM won't help you herel, also. xml files often include a encoding declaration `<?xml version="1.0" encoding="iso-8859-1" ?>` inside the file, and without encoding declarations, you can use the BOM. If there's no BOM xml defaults to UTF-8. That's in the specification of XML format, you can't apply it to other files.

nosklo 2010-10-26 20:42:16

okay the encoding in use is UTF-8. *in xml*

sunshine 2010-10-26 23:54:42

I am now trying to put a new exception handling block. Hopefully it will show some info. Thnx.

sunshine 2010-10-27 00:01:20

okay finally got it to run :) added a exception catch block.. for any exception, I first decode myvar1 with ISO-8859-1 and then I write the output back using unicode(myvar1).

sunshine 2010-10-27 03:46:17

am sure there are more ways to do it. for now it's cool. been long night but I learnt things. Thnx for your help.

sunshine 2010-10-27 03:49:10

0xab is 171. len(aBuf) is trying to get length of a number. so failing ..? Isn't this a bug with chardet then?

sunshine 2010-10-25 13:19:59

No it's a bug with you :-) See my edit.

John Machin 2010-10-25 13:22:39

:) then how can I identify the encoding type of a single byte input?

sunshine 2010-10-25 13:25:45

when using unicode(myvar1), it fails saying UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position xxx

sunshine 2010-10-25 13:27:38

You can't. There are multiple possibilities for every byte. Why do you think that you need to identify the encoding for a single byte???

John Machin 2010-10-25 13:28:55

John Machin 2010-10-25 13:31:00

updated my question above to be more clear.

sunshine 2010-10-25 13:37:40

ansaurus

tags:

views:

answers:

[Python] Splitting results from chardet output to collect encoding type.

related questions