Is there a way to iterate over every character in a given encoding, and print it's code? Say, UTF8?
dude, do you have any idea how many code points there are in unicode...
btw, from the Python docs:
unichr(i)
Return the Unicode string of one character whose Unicode code is the integer i. For example, unichr(97) returns the string u'a'. This is the inverse of ord() for Unicode strings. The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise. For ASCII and 8-bit strings see chr().
New in version 2.0.
So
import sys
for i in xrange(sys.maxunicode + 1):
print unichr(i)
For single-byte encodings you can use:
''.join(chr(x) for x in range(256)).decode(encoding, 'ignore')
to get a string containing all the valid characters in the given encoding.
For fixed-size multibyte encodings careful use of struct.pack()
in place of chr()
should work.
All Unicode characters can be represented in UTF-n
for all defined n
. What are you trying to achieve?
If you really want to do something like print all the valid characters in a particular encoding, without needing to know whether the encoding is "single byte" or "multi byte" or whether its size is fixed or not:
import unicodedata as ucd
import sys
def dump_encoding(enc):
for i in xrange(sys.maxunicode):
u = unichr(i)
try:
s = u.encode(enc)
except UnicodeEncodeError:
continue
try:
name = ucd.name(u)
except:
name = '?'
print "U+%06X %r %s" % (i, s, name)
if __name__ == "__main__":
dump_encoding(sys.argv[1])
Suggestions: Try it out on something small, like cp1252
. Redirect stdout to a file.