views:

207

answers:

4

I want to know how the "isupper" macro is defined in C/C++. Could you please provide me the same or point me to available resources. I tried looking at ctype.h but couldnt figure it out.

+5  A: 

It's a function, not a macro. The function definition of isupper() differs depending on things like locale and the current character set - that's why there's a function specifically for this purpose.

For ASCII, because of the way the letters are assigned, it's actually quite easy to test for this. If the ASCII code of the character falls in between 0x41 and 0x5A inclusive, then it is an upper case letter.

In silico
+11  A: 

It's implementation defined -- every vendor can, and usually does, do it differently.

The most common usually involves a "traits" table - an array with one element for each character, the value of that element being a collection of flags indicates details about the character. An example would be:

 traits[(int) 'C'] = ALPHA | UPPER | PRINTABLE;

In which case,, isupper() would be something like:

 #define isupper(c) ((traits[(int)(c)] & UPPER) == UPPER)
James Curran
+5  A: 

It's implementation-specific. One obvious way to implement it would be:

extern char *__isupper;
#define isupper(x) ((int)__isupper[(x)])

Where __isupper points to an array of 0's and 1's determined by the locale. However this sort of technique has gone out of favor since accessing global variables in shared libraries is rather inefficient and creates permanent ABI requirements, and since it's incompatible with POSIX thread-local locales.

Another obvious way to implement it on ASCII-only or UTF-8-only implementations is:

#define isupper(x) ((unsigned)(x)-'A'<='Z'-'A')
R..
very nice, i never thought of that (then again i never tried :])
Matt Joiner
By the way, **all** implementations should `#define isdigit(x) ((unsigned)(x)-'0'<10)` because ISO C requires the behavior to be identical to this expression and it's optimal.
R..
I'm not as confident as you seem to be about UTF8 there. Surely all those other languages outside the ASCII range have uppercase as well? And, if you leave it there, you should probably say "Unicode". UTF8 is the encoding, not the character set.
paxdiablo
@paxdiablo, this is `isupper` not `iswupper`. In UTF-8, all bytes outside the ASCII range have no meaning by themselves, only as part of multibyte sequences, so the non-wide `is*` functions always return 0 for non-ASCII bytes.
R..
@R, I think you're confusing the term byte here. A byte _is_ a char. There are _no_ multibyte chars in ISO C. If the underlying character set is Unicode (whatever the encoding), isupper and its brethren must handle other languages as well - it's locale-specific.
paxdiablo
@paxdiablo, you're wrong. ISO C very clearly defines "multibyte character" and the functions to convert between multibyte characters and `wchar_t`. Whether any multibyte characters exist and the nature of their encoding is implementation- and locale-specific. But I said specifically a UTF-8-only implementation. If the encoding is UTF-8, the only values of `char` which correspond to (wide) characters by themselves are the ASCII characters 0-0x7f. The values 0x80-0xbf and 0xc2-0xf4 are used as components in proper multibyte sequences, and any remaining values are purely invalid (EILSEQ).
R..
Perhaps you think that that `isupper(0x100)` (A with macron) should return 1. If char is 8-bit, this is bogus since the argument to isupper must be a valid value of unsigned char. If char is larger than 8-bit and used directly for storing Unicode codepoints outside of 0-0x7f, the encoding is not UTF-8 but rather UTF-16, UTF-32, or some nonstandard hybrid encoding and irrelevant to my comments about a UTF-8-only implementation.
R..
Fair enough, it looks like we were talking at cross-purposes here. My understanding (which is a misunderstanding as you have pointed out) was that Unicode (full UTF-32) was the underlying encoding in which case other characters would be required to be upper. Thanks for clearing that up.
paxdiablo
If the `char` encoding is UTF-8, then `wchar_t` will almost certainly be expressed as Unicode codepoints (equivalent to UTF-32), and the `isw*` functions will need to handle all the extra characters, bit the non-wide `is*` functions simply can't.
R..
@R: You're wrong on isdigit(x). While isdigit('0') must be true, and isdigit('a') false, the standard does not define what should happen for any chars outside the ASCII range. In particular, char may also support non-arabic digits. Your `((unsigned)(x)-'0'<10)` expression fails for them.
MSalters
@MSalters: you're wrong. Read the standard. ISO C specifies that the digits are 0 1 2 3 4 5 6 7 8 9 and nothing else. You're free not to like that, but it's a fact.
R..
Bods, if you're going to call on the standard as authority, at least quote the section - you'll be taken more seriously that way :-) and it won't be such a pain to verify.
paxdiablo
Yeah, wish I had a copy with me. I'm on 2G GPRS and the source I know offhand for that is a reference in POSIX (either base definitions or rationale, I forget which) stating that the requirements on locales were strengthened for alignment with ISO C to require that the digits be exactly 0-9.
R..
The statement comes from the part which introduces the basic character set (5.2.1). Of _those_ characters, only 0-9 are digits, and only those digits can be used to form number literals as in `int i = 42; /* must use digits 0-9 here */`. However, `isdigit()` works at runtime and is not restricted to the basic character set.
MSalters
@MSalters: it is equally restricted. The specification of `isdigit` refers to 5.2.1. 7.4.1.5 reads: `The isdigit function tests for any decimal-digit character (as defined in 5.2.1).` Compare with 7.4.1.2: `The isalpha function tests for any character for which isupper or islower is true, or any character that is one of a locale-specific set of alphabetic characters for which none of iscntrl, isdigit, ispunct, or isspace is true.` Stop arguing about topics you know nothing about.
R..
MSalters
R..
+1  A: 

It's actually fairly complicated, in GCC for instance. But a simple implementation of isupper could be (although it has a double-evaluation bug) most simply defined as:

#define isupper(c) (c >= 'A') & (c <= 'Z')

http://ideone.com/GlN05

GCC specifically checks bit 0 is 1 in the character for the current locale:

(*__ctype_b_loc ())[(int) (c)] & (unsigned short int) (1 << (0))

Where __ctype_b_loc() is a function that returns a pointer into an array of characters in the current locale that contains characteristics for each character in the current character set.

Scott S. McCoy
This macros is broken because it's missing parentheses around the argument and it evaluates its argument twice (think of `isupper(*s++)`...). You need to cast to `unsigned` and use unsigned overflow semantics to test the range without evaluating the argument more than once.
R..
To be fair, I called out the double-evaluation bug. :-)
Scott S. McCoy