views:

100

answers:

3

Is there a way to iterate over every character in a given encoding, and print it's code? Say, UTF8?

+2  A: 

dude, do you have any idea how many code points there are in unicode...

btw, from the Python docs:

unichr(i)

Return the Unicode string of one character whose Unicode code is the integer i. For example, unichr(97) returns the string u'a'. This is the inverse of ord() for Unicode strings. The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise. For ASCII and 8-bit strings see chr().

New in version 2.0.

So

import sys
for i in xrange(sys.maxunicode + 1):
    print unichr(i)
wich
That's why I don't wanna do it by hand :)
Geo
You don't want to do it automatically either, it's over one million characters, it'd take a tree to print that, in fact it's called the unicode standard book ;)
wich
Speaking of printing: http://ian-albert.com/misc/unichart.php
Ignacio Vazquez-Abrams
@wich: So, don't print it :) Not all of our computers are connected directly to a line printer anymore :) Over a million?! You'd need some kind of machine, like a computer, to handle that much data! :)
Ian Clelland
+1  A: 

For single-byte encodings you can use:

''.join(chr(x) for x in range(256)).decode(encoding, 'ignore')

to get a string containing all the valid characters in the given encoding.

For fixed-size multibyte encodings careful use of struct.pack() in place of chr() should work.

Ignacio Vazquez-Abrams
+3  A: 

All Unicode characters can be represented in UTF-n for all defined n. What are you trying to achieve?

If you really want to do something like print all the valid characters in a particular encoding, without needing to know whether the encoding is "single byte" or "multi byte" or whether its size is fixed or not:

import unicodedata as ucd
import sys

def dump_encoding(enc):
    for i in xrange(sys.maxunicode):
        u = unichr(i)
        try:
            s = u.encode(enc)
        except UnicodeEncodeError:
            continue
        try:
            name = ucd.name(u)
        except:
            name = '?'
        print "U+%06X %r %s" % (i, s, name)

if __name__ == "__main__":
    dump_encoding(sys.argv[1])

Suggestions: Try it out on something small, like cp1252. Redirect stdout to a file.

John Machin