tags:

views:

49

answers:

2

Hi All,

I am testing chardet in one of my scripts. I wanted to identify the encoding type of a result variable and chardet seems to do fine here.

So this is what I am doing:

myvar1 <-- gets its value from other functions

myvar2 = chardet.detect(myvar1) <-- to detect the encoding type of myvar1

Now when I do a print myvar2, I receive the output:

{'confidence': 1.0, 'encoding': 'ascii'}

Question 1: Can someone give pointer on how to collect only the encoding value part out of this, i.e. ascii.

Edit: The scenario is as follows:

I am using unicode(myvar1) to write all input as unicode. But as soon as myvar1 gets a value like 0xab, unicode(myvar1) fails with the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position xxx: ordinal not in range(128)

Therefore, I am tring to:

  1. first identify the encoding type of the input which comes in myvar1,
  2. take the encoding type in myvar2,
  3. decode the input (myvar1) with this encoding (myvar2) using decode() [?]
  4. pass it on to unicode.

The input coming in is variable and not in my control.

I am sure there are other ways to do this, but I am new to this. And I am open to trying.

Any pointer please.

Many Thanks.

+1  A: 
nosklo
Thnx. But the input coming to myvar1 is variable. I do not know specific encoding type for each value that comes in to myvar1. Currently, I am using unicode(myvar1) but I have found it to fail with certain types. Therefore, looking for chardet autodetect scheme.
sunshine
@sunshine: You can't read a text you don't know specific encoding type for. It is ***impossible*** -- because the same byte sequence can mean different chars on different encodings. In other words, encodings are ambiguous. `chardet` is just a **guess**. It can and will fail in the wild. The best and **only** reliable way is to ask whoever generated the string which encoding was used in first place.
nosklo
okay. could you pls see the scenario in my question above and share your thoughts?
sunshine
@sunshine: I've added information at my answer above.
nosklo
I understand. I checked the files at my end. There are 2 file types - 1 plain text and 2 xml. I did a cat -v file > newfile and then used this file into the script. Script still breaks with the same error.
sunshine
I read on using BOM for similar issues. Any idea on that? I am thinking of doing something like adding an if { myvar1 = '0xab' then decode(myvar1, ('UTF-8')) action 1 else action 2 }... what do you think.
sunshine
@sushine: as I have said,before, you need to know the actual encoding. Trying to decode it in some encoding and falling back to another isn't reliable, because the same byte can be valid in multiple encodings, meaning different chars. The BOM won't help you herel, also. xml files often include a encoding declaration `<?xml version="1.0" encoding="iso-8859-1" ?>` inside the file, and without encoding declarations, you can use the BOM. If there's no BOM xml defaults to UTF-8. That's in the specification of XML format, you can't apply it to other files.
nosklo
okay the encoding in use is UTF-8. *in xml*
sunshine
I am now trying to put a new exception handling block. Hopefully it will show some info. Thnx.
sunshine
okay finally got it to run :) added a exception catch block.. for any exception, I first decode myvar1 with ISO-8859-1 and then I write the output back using unicode(myvar1).
sunshine
am sure there are more ways to do it. for now it's cool. been long night but I learnt things. Thnx for your help.
sunshine
A: 

second problem: as the traceback says, aBuf is an int but it's expecting a string. You need to find out why.

uhhhh ... just worked it out; you are feeding it a single byte, expressed as an integer (0xab) instead of a string ('\xab'). In any case, chardet requires much more than 1 byte to be able to guess an encoding. Feeding any charset detector one byte is utterly pointless.

John Machin
0xab is 171. len(aBuf) is trying to get length of a number. so failing ..? Isn't this a bug with chardet then?
sunshine
No it's a bug with you :-) See my edit.
John Machin
:) then how can I identify the encoding type of a single byte input?
sunshine
when using unicode(myvar1), it fails saying UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position xxx
sunshine
You can't. There are multiple possibilities for every byte. Why do you think that you need to identify the encoding for a single byte???
John Machin
John Machin
updated my question above to be more clear.
sunshine