views:

36

answers:

1

I am trying to use the Universal Encoding Detector (chardet) in Python to detect the most probable character encoding in a text file ('infile') and use that in further processing.

While chardet is designed primarily for detecting the character encoding of webpages, I have found an example of it being used on individual text files.

However, I cannot work out how to tell the script to set the most likely character encoding to the variable 'charenc' (which is used several times throughout the script).

My code, based on a combination of the aforementioned example and chardet's own documentation is as follows:

import chardet    
rawdata=open(infile,"r").read()
chardet.detect(rawdata)

Character detection is necessary as the script goes on to run the following (as well as several similar uses):

inF=open(infile,"rb")
s=unicode(inF.read(),charenc)
inF.close()

Any help would be greatly appreciated.

+1  A: 

chardet.detect returns a dictionary which provides the encoding as the value associated with the key 'encoding'. So you can do this:

import chardet    
rawdata = open(infile, "r").read()
result = chardet.detect(rawdata)
charenc = result['encoding']
David Zaslavsky
Thank you! I thought it would be something simple!
Haidon