ansaurus

Question

How can I programmatically find the list of codecs known to Python?

Answer 1

A:

I don't think the complete list is stored anywhere in the python standard library. Instead, encodings are loaded on demand through calls to encoding.search_function(encoding). If you study the code there, it looks like encoding string is first normalized and then the encodings package is searched for submodules whose name matches encoding.

The following uses pkgutil to list all the submodules of encoding, and then adds them to those listed in encoding.aliases.aliases.

Unfortunately, encoding.aliases.aliases contains one encoding, tactis that is not generated by the above, so I tried to generate the complete list by union-ing the two sets.

import encodings
import os
import pkgutil

modnames=set([modname for importer, modname, ispkg in pkgutil.walk_packages(
    path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases=set(encodings.aliases.aliases.values())

print(modnames-aliases)
# set(['charmap', 'unicode_escape', 'cp1006', 'unicode_internal', 'punycode', 'string_escape', 'aliases', 'palmos', 'mac_centeuro', 'mac_farsi', 'mac_romanian', 'cp856', 'raw_unicode_escape', 'mac_croatian', 'utf_8_sig', 'mac_arabic', 'undefined', 'cp737', 'idna', 'koi8_u', 'cp875', 'cp874', 'iso8859_1'])

print(aliases-modnames)
# set(['tactis'])

codec_names=modnames.union(aliases)
print(codec_names)
# set(['bz2_codec', 'cp1140', 'euc_jp', 'cp932', 'punycode', 'euc_jisx0213', 'aliases', 'hex_codec', 'cp500', 'uu_codec', 'big5hkscs', 'mac_romanian', 'mbcs', 'euc_jis_2004', 'iso2022_jp_3', 'iso2022_jp_2', 'iso2022_jp_1', 'gbk', 'iso2022_jp_2004', 'unicode_internal', 'utf_16_be', 'quopri_codec', 'cp424', 'iso2022_jp', 'mac_iceland', 'raw_unicode_escape', 'hp_roman8', 'iso2022_kr', 'cp875', 'iso8859_6', 'cp1254', 'utf_32_be', 'gb2312', 'cp850', 'shift_jis', 'cp852', 'cp855', 'iso8859_3', 'cp857', 'cp856', 'cp775', 'unicode_escape', 'cp1026', 'mac_latin2', 'utf_32', 'mac_cyrillic', 'base64_codec', 'ptcp154', 'palmos', 'mac_centeuro', 'euc_kr', 'hz', 'utf_8', 'utf_32_le', 'mac_greek', 'utf_7', 'mac_turkish', 'utf_8_sig', 'mac_arabic', 'tactis', 'cp949', 'zlib_codec', 'big5', 'iso8859_9', 'iso8859_8', 'iso8859_5', 'iso8859_4', 'iso8859_7', 'cp874', 'iso8859_1', 'utf_16_le', 'iso8859_2', 'charmap', 'gb18030', 'cp1006', 'shift_jis_2004', 'mac_roman', 'ascii', 'string_escape', 'iso8859_15', 'iso8859_14', 'tis_620', 'iso8859_16', 'iso8859_11', 'iso8859_10', 'iso8859_13', 'cp950', 'utf_16', 'cp869', 'mac_farsi', 'rot_13', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'shift_jisx0213', 'johab', 'mac_croatian', 'cp1255', 'latin_1', 'cp1257', 'cp1256', 'cp1251', 'cp1250', 'cp1253', 'cp1252', 'cp437', 'cp1258', 'undefined', 'cp737', 'koi8_r', 'cp037', 'koi8_u', 'iso2022_jp_ext', 'idna'])

unutbu 2010-09-29 17:49:16

@killown: did you downvote this? If yes, you shouldn't, because it gives a programmatically useful result (a set), which was my primary target. How can I iterate over `help(encodings)` in a program?

ΤΖΩΤΖΙΟΥ 2010-09-29 18:20:50

In general: anyone who downvoted this answer has absolutely no grasp of what up/downvoting means here and should hover their mouse pointers over the arrows before clicking.

ΤΖΩΤΖΙΟΥ 2010-09-29 18:24:45

@killown: indeed; see who voted to close the question. Unfortunately, the list of similar questions *after* the question is posted is *not* the same with the one *during* typing in the question; also, the search with "[python] encodings list" did not procure the old question.

ΤΖΩΤΖΙΟΥ 2010-09-29 18:35:58

ansaurus

tags:

views:

answers:

How can I programmatically find the list of codecs known to Python?

related questions