ansaurus

Question

How can I iterate over every character in a given encoding using Python?

Answer 1

+2 A:

dude, do you have any idea how many code points there are in unicode...

btw, from the Python docs:

unichr(i)

Return the Unicode string of one character whose Unicode code is the integer i. For example, unichr(97) returns the string u'a'. This is the inverse of ord() for Unicode strings. The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise. For ASCII and 8-bit strings see chr().

New in version 2.0.

So

import sys
for i in xrange(sys.maxunicode + 1):
    print unichr(i)

wich 2010-01-19 21:16:29

That's why I don't wanna do it by hand :)

Geo 2010-01-19 21:17:34

You don't want to do it automatically either, it's over one million characters, it'd take a tree to print that, in fact it's called the unicode standard book ;)

wich 2010-01-19 21:19:38

Speaking of printing: http://ian-albert.com/misc/unichart.php

Ignacio Vazquez-Abrams 2010-01-19 21:49:26

@wich: So, don't print it :) Not all of our computers are connected directly to a line printer anymore :) Over a million?! You'd need some kind of machine, like a computer, to handle that much data! :)

Ian Clelland 2010-01-19 22:12:57

Answer 2

+1 A:

For single-byte encodings you can use:

''.join(chr(x) for x in range(256)).decode(encoding, 'ignore')

to get a string containing all the valid characters in the given encoding.

For fixed-size multibyte encodings careful use of struct.pack() in place of chr() should work.

Ignacio Vazquez-Abrams 2010-01-19 21:23:44

Answer 3

+3 A:

All Unicode characters can be represented in UTF-n for all defined n. What are you trying to achieve?

If you really want to do something like print all the valid characters in a particular encoding, without needing to know whether the encoding is "single byte" or "multi byte" or whether its size is fixed or not:

import unicodedata as ucd
import sys

def dump_encoding(enc):
    for i in xrange(sys.maxunicode):
        u = unichr(i)
        try:
            s = u.encode(enc)
        except UnicodeEncodeError:
            continue
        try:
            name = ucd.name(u)
        except:
            name = '?'
        print "U+%06X %r %s" % (i, s, name)

if __name__ == "__main__":
    dump_encoding(sys.argv[1])

Suggestions: Try it out on something small, like cp1252. Redirect stdout to a file.

John Machin 2010-01-19 22:09:47

ansaurus

tags:

views:

answers:

How can I iterate over every character in a given encoding using Python?

related questions