views:

709

answers:

10

Every now and then, someone on SO points out that char (aka 'byte') isn't necessarily 8 bits.

It seems that 8-bit char is almost universal. I would have thought that for mainstream platforms, it is necessary to have an 8-bit char to ensure its viability in the marketplace.

Both now and historically, what platforms use a char that is not 8 bits, and why would they differ from the "normal" 8 bits?

When writing code, and thinking about cross-platform support (e.g. for general-use libraries), what sort of consideration is it worth giving to platforms with non-8-bit char?

In the past I've come across some Analog Devices DSPs for which char is 16 bits. DSPs are a bit of a niche architecture I suppose. (Then again, at the time hand-coded assembler easily beat what the available C compilers could do, so I didn't really get much experience with C on that platform.)

+3  A: 

The C and C++ programming languages, for example, define byte as "addressable unit of data large enough to hold any member of the basic character set of the execution environment" (clause 3.6 of the C standard). Since the C char integral data type must contain at least 8 bits (clause 5.2.4.2.1), a byte in C is at least capable of holding 256 different values. Various implementations of C and C++ define a byte as 8, 9, 16, 32, or 36 bits

Quoted from http://en.wikipedia.org/wiki/Byte#History

Not sure about other languages though.

http://en.wikipedia.org/wiki/IBM_7030_Stretch#Data_Formats

Defines a byte on that machine to be variable length

petantik
"Not sure about other languages though" -- historically, most languages allowed the machine's architecture to define its own byte size. Actually historically so did C, until the standard set a lower bound at 8.
Windows programmer
+8  A: 

When writing code, and thinking about cross-platform support (e.g. for general-use libraries), what sort of consideration is it worth giving to platforms with non-8-bit char?

It's not so much that it's "worth giving consideration" to something as it is playing by the rules. In C++, for example, the standard says all bytes will have "at least" 8 bits. If your code assumes that bytes have exactly 8 bits, you're violating the standard.

This may seem silly now -- "of course all bytes have 8 bits!", I hear you saying. But lots of very smart people have relied on assumptions that were not guarantees, and then everything broke. History is replete with such examples.

For instance, most early-90s developers assumed that a particular no-op CPU timing delay taking a fixed number of cycles would take a fixed amount of clock time, because most consumer CPUs were roughly equivalent in power. Unfortunately, computers got faster very quickly. This spawned the rise of boxes with "Turbo" buttons -- whose purpose, ironically, was to slow the computer down so that games using the time-delay technique could be played at a reasonable speed.


One commenter asked where in the standard it says that char must have at least 8 bits. It's in section 5.2.4.2.1. This section defines CHAR_BIT, the number of bits in the smallest addressable entity, and has a default value of 8. It also says:

Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.

So any number equal to 8 or higher is suitable for substitution by an implementation into CHAR_BIT.

John Feminella
I haven't seen a Turbo button in at least 20 years - do you really think it's germane to the question?
Mark Ransom
@Mark Ransom: That's the whole point. Developers often rely on assumptions which seem to be true at the moment, but which are much shakier than they initially appear. (Can't count the number of times I've made _that_ mistake!) The Turbo button should be a painful reminder not to make unnecessary assumptions, and certainly not to make assumptions that aren't guaranteed by a language standard as if they were immutable facts.
John Feminella
Could you point out to place in C++ Standard which says that the bye has at least 8 bits? It is a common belief however I personally failed to find it in the Standard. The only thing I found in Standard is which characters must be representable by `char` there are more then 64 of them but less that 128 so 7 bits would be enough.
Adam Badura
Section 18.2.2 invokes the C standard for it. In the C standard it's section 7.10 and then section 5.4.2.4.1. Page 22 in the C standard.
Windows programmer
+6  A: 

Machines with 36-bit architectures have 9-bit bytes. According to Wikipedia, machines with 36-bit architectures include:

  • Digital Equipment Corporation PDP-6/10
  • IBM 701/704/709/7090/7094
  • UNIVAC 1103/1103A/1105/1100/2200,
R Samuel Klatchko
Windows programmer
Actually, the Dec-10 had also 6-bit characters - you could pack 6 of these into a 36-bit word (ex-Dec-10 programmer talking)
anon
The DEC-20 used five 7-bit ASCII characters per 36-bit word on the TOPS-20 O/S.
Loadmaster
As far as I remember, on the PDP-10 7 bits ASCII, packed 5 bytes to a word was the most common format for text files (dropping a bit, which when set was interpreted as an indication that the word was a line number in some contexts). The SIXBIT charset (a subset of ASCII, dropping the control and lower case columns) was used for some things (for instance in for names in object file) but not for text file as there was no way to indicate the end of lines... 9 bits characters was not of common use, excepted perhaps to port C programs to the PDP-10.
AProgrammer
This reminds me of the joke UTF-9 http://tools.ietf.org/html/rfc4042
jleedev
+4  A: 

A few of which I'm aware:

  • DEC PDP-10: variable, but most often 7-bit chars packed 5 per 36-bit word, or else 9 bit chars, 4 per word
  • Control Data mainframes (CDC-6400, 6500, 6600, 7600, Cyber 170, Cyber 176 etc.) 6-bit chars, packed 10 per 60-bit word.
  • Unisys mainframes: 9 bits/byte
  • Windows CE: simply doesn't support the `char` type at all -- requires 16-bit wchar_t instead
Jerry Coffin
I'd be surprised by C support for the first few...
ephemient
@ephemient:I'm pretty sure there was at least one (pre-standard) C compiler for the PDP-10/DecSystem 10/DecSystem 20. I'd be *very* surprised at a C compiler for the CDC mainframes though (they were used primarily for numeric work, so the Fortran compiler was the big thing there). I'm pretty sure the others do have C compilers.
Jerry Coffin
Did the Windows CE compiler really not support the `char` type at all? I know that the system libraries only supported the wide char versions of functions that take strings, and that at least some versions of WinCE removed the ANSI string functions like strlen, to stop you doing char string-handling. But did it really not have a char type at all? What was `sizeof(TCHAR)`? What type did malloc return? How was the Java `byte` type implemented?
Steve Jessop
@Steve:Well, it's been a while since I wrote any code for CE, so I can't swear to it, but my recollection is that even attempting to define a char variable leads to a compiler error. Then again, that *is* depending on my memory, which means it isn't exactly certain.
Jerry Coffin
How strange. And certainly not C. I worked at a company with a multi-platform product that included at least two versions of WinCE, but I never interacted much with Windows code, and the portable code in the product (that is, most of the product) wasn't compiled with Microsoft's compiler.
Steve Jessop
Windows CE supports char, which is a byte. See Craig McQueen's comment on Richard Pennington's answer. Bytes are needed just as much in Windows CE as everywhere else, no matter what sizes they are everywhere else.
Windows programmer
Huh, I thought C skipped over the PDP-10. But perhaps there was a port; all of this is before my time anyhow ;-)
ephemient
There are (were?) at least two implementations of C for the PDP-10: KCC and a port of gcc (http://pdp10.nocrew.org/gcc/).
AProgrammer
+1  A: 

ints used to be 16 bits (pdp11, etc.). Going to 32 bit architectures was hard. People are getting better: Hardly anyone assumes a pointer will fit in a long any more (you don't right?). Or file offsets, or timestamps, or ...

8 bit characters are already somewhat of an anachronism. We already need 32 bits to hold all the world's character sets.

Richard Pennington
True. The name `char` is a bit quaint now in Unicode days. I care more about 8-bit units (octets) when dealing with binary data, e.g. file storage, network communications. `uint8_t` is more useful.
Craig McQueen
+1  A: 

It appears that you can still buy an IM6100 (i.e. a PDP-8 on a chip) out of a warehouse. That's a 12-bit architecture.

dmckee
+8  A: 

char is also 16 bit on the Texas Instruments C54x DSPs, which turned up for example in OMAP2. There are other DSPs out there with 16 and 32 bit char. I think I even heard about a 24-bit DSP, but I can't remember what, so maybe I imagined it.

Another consideration is that POSIX mandates CHAR_BIT == 8. So if you're using POSIX you can probably assume it, and if someone later needs to port your code to a near-implementation of POSIX, that just so happens to have the functions you use but a different size char, that's their bad luck.

In general, though, I think it's almost always easier to work around the issue than to think about it. Just type CHAR_BIT. If you want an exact 8 bit type, use int8_t. Your code will noisily fail to compile on implementations which don't provide one, instead of silently using a size you didn't expect. At the very least, if I hit a case where I had a good reason to assume it, then I'd assert it.

Steve Jessop
TI C62xx and C64xx DSPs also have 16-bit chars. (uint8_t isn't defined on that platform.)
msemack
+1  A: 

For one, Unicode characters are longer than 8-bit. As someone mentioned earlier, the C spec defines data types by their minimum sizes. Use sizeof and the values in limits.h if you want to interrogate your data types and discover exactly what size they are for your configuration and architecture.

For this reason, I try to stick to data types like uint16_t when I need a data type of a particular bit length.

Edit: Sorry, I initially misread your question.

The C spec says that a char object is "large enough to store any member of the execution character set". limits.h lists a minimum size of 8 bits, but the definition leaves the max size of a char open.

Thus, the a char is at least as long as the largest character from your architecture's execution set (typically rounded up to the nearest 8-bit boundary). If your architecture has longer opcodes, your char size may be longer.

Historically, the x86 platform's opcode was one byte long, so char was initially an 8-bit value. Current x86 platforms support opcodes longer than one byte, but the char is kept at 8 bits in length since that's what programmers (and the large volumes of existing x86 code) are conditioned to.

When thinking about multi-platform support, take advantage of the types defined in stdint.h. If you use (for instance) a uint16_t, then you can be sure that this value is an unsigned 16-bit value on whatever architecture, whether that 16-bit value corresponds to a char, short, int, or something else. Most of the hard work has already been done by the people who wrote your compiler/standard libraries.

If you need to know the exact size of a char because you are doing some low-level hardware manipulation that requires it, I typically use a data type that is large enough to hold a char on all supported platforms (usually 16 bits is enough) and run the value through a convert_to_machine_char routine when I need the exact machine representation. That way, the platform-specific code is confined to the interface function and most of the time I can use a normal uint16_t.

bta
The question didn't ask about characters (whether Unicode or not). It asked about char, which is a byte.
Windows programmer
Also, the execution character set has nothing to do with opcodes, it's the character set used at execution, think of cross-compilers.
ninjalj
+1  A: 

Many DSP chips have 16- or 32-bit char. TI routinely makes such chips for example.

Alok
That's an interesting link.
Craig McQueen
+1  A: 

The DEC PDP-8 family had a 12 bit word although you usually used 8 bit ASCII for output (on a Teletype mostly). However, there was also a 6-BIT character code that allowed you to encode 2 chars in a single 12-bit word.

PrgTrdr