tags:

views:

106

answers:

3

I'm looking for a table (or a way to generate one) for every character in each of the following C Character Sets:

  • Basic Character Set
  • Basic Execution Character Set
  • Basic Source Character Set
  • Execution Character Set
  • Extended Character Set
  • Source Character Set

C99 mentions all six of these under section 5.2.1. However, I've found it extremely cryptic to read and lacking in detail.

The only character sets that it clearly defines is the Basic Execution Character Set and the Basic Source Character Set:

52 upper- and lower-case letters in the Latin alphabet:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

a b c d e f g h i j k l m n o p q r s t u v w x y z

Ten decimal digits:

0 1 2 3 4 5 6 7 8 9

29 graphic characters:

! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~

4 whitespace characters:

space, horizontal tab, vertical tab, form feed

I believe these are the same as the Basic Character Set, though I'm guessing as C99 does not explicitly state this. The remaining Character Sets are a bit of a mystery to me.

Thanks for any help you can offer! :)

+2  A: 

You might have a look a GNU iconv. Among many others, it will print or convert both Java and C99 strings. iconv is a command line interface to libiconv which, very likely, is what your C99 compiler is using internally for these character conversions.

Type iconv -l to see what strings are available on your system. You will need to recompile from source to change that set.

On OS X, I have 141 character sets. On Ubuntu, I have 1,168 character sets (with most of those being aliases).

drewk
Why the down vote?
drewk
+5  A: 

Except for the Basic Character Set as you mentioned, all of the rest of the character sets are implementation-defined. That means that they could be anything, but the implementation (that is, the C compiler/libraries/toolchain implementation) must document those decisions. The key paragraphs here are:

§3.4.1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made

§3.4.2 locale-specific behavior
behavior that depends on local conventions of nationality, culture, and language that each implementation documents

§5.2.1.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

So, look at your C compiler's documentation to find out what the other character sets are. For example, in my man page for gcc, some of the command line options state:

   -fexec-charset=charset
       Set the execution character set, used for string and character
       constants.  The default is UTF-8.  charset can be any encoding
       supported by the system's "iconv" library routine.

   -fwide-exec-charset=charset
       Set the wide execution character set, used for wide string and
       character constants.  The default is UTF-32 or UTF-16, whichever
       corresponds to the width of "wchar_t".  As with -fexec-charset,
       charset can be any encoding supported by the system's "iconv"
       library routine; however, you will have problems with encodings
       that do not fit exactly in "wchar_t".

   -finput-charset=charset
       Set the input character set, used for translation from the
       character set of the input file to the source character set used by
       GCC.  If the locale does not specify, or GCC cannot get this
       information from the locale, the default is UTF-8.  This can be
       overridden by either the locale or this command line option.
       Currently the command line option takes precedence if there's a
       conflict.  charset can be any encoding supported by the system's
       "iconv" library routine.

To get a list of the encodings supported by iconv, run iconv -l. My system has 143 different encodings to choose from.

Adam Rosenfield
+1  A: 

As far as I see, the standard doesn't talk about a basic character set as something distinct form the source character set and execution character set. The standard lays out that there are 2 character sets it's concerned with - the source character set and execution character set. each of these has a 'basic' and 'extended' component (and the extended component of either can be the empty set).

You have a "source character set" that is comprised of a "basic source character set" and zero or more "extended characters". The combination of the basic source character set and those extended characters is called the extended source character set.

Similarly for the execution character set (there's a basic execution character set that combined with zero or more extended characters make up the extended execution characters set).

The standard (and your question) enumerate characters that must be in the basic characters set - there can be other characters in the basic set.

As far as the difference between the basic 'range' and the extended 'range' of each character set, the values of the members of the basic character set must fit within a byte - that restriction doesn't hold for the extended characters. Also note, that this doesn't necessarily mean that the source file encoding must a single-byte encoding.

The values of characters in the source character sets do not need to agree with the values in the execution character sets (for example, the source character set might be comprised of ASCII, while the execution character set might be EBCDIC).

Michael Burr