tags:

views:

280

answers:

5

I'm currently reading "The C Programming Language 2nd edition" and I'm not clear about this exercise:

Functions like isupper can be implemented to save space or to save time. Explore both possibilities.

  • How can I implement this function?
  • How should I write two versions, one to save time and one to save space (some pseudo code would be nice)?

I would appreciate some advice about this.

+1  A: 

One method can do some comparisons with 'A', 'Z'.

The other method can use a pre-computed lookup table.

It's the first suggestion mentioned on the Wikipedia page for time-space tradeoffs.

Mark Byers
+4  A: 

The classic trade-off is speed versus memory: either compute a result, or look it up in a table.

It should not be hard to figure out how these would look, for the isupper() function.

A few things make it perhaps unexpectedly complicated on today's mainstream CPU:s, though:

A table to support ASCII needs 128 bits, or 256 bits if you don't want to mask off the topmost bit yourself, assuming an 8-bit char. This is only 32 bytes, but that is probably still more than code that exploits the sequential nature of the ASCII mapping. Large code size is generally bad for performance, since it affects cache efficiency and generally exposes the large difference in bandwidth between today's CPU:s and their memory subsystems.

Code using explicit comparisons to compute the result, without exploiting the sequential mapping, will be pretty large, larger than the corresponding look-up table. This is not typical; it's easier to see the difference in the speed-versus-memory trade-off for cases where the code to compute a value is more compact than the look-up table.

unwind
A: 

You can save both space and time, actually. Upper case characters are contiguous in the ASCII table, so you just have to do this (this doesn't handle locales obviously, but I doubt that is the point of your exercise):

BOOL isUpper(char c)
{
    return (c >= 'A' && c <= 'Z');
}
Terry Mahaffey
But note that this doesn't work for foreign languages, or even the ANSI character set, it works only for ASCII, which is undefined for char values > 127
John Knoeller
...which is why i said "this doesn't handle locales obviously, but I doubt that is the point of your exercise"
Terry Mahaffey
A: 

Upper case characters are ASCII hex 0x41 through 0x5A. That means bit 0x40 is always set. If you are sure that the argument is an ASCII character, you can just return:

(c & 0x40);

Bruce
That returns true for 0x40 and 0x5B..0x5F, even restricting yourself to 7-bit ASCII (0x00..0x7F). However, and much worse, it also returns true for the lower-case letters (0x61..0x7A), plus more 'punctuation' 0x60, 0x7B..0x7F. Not a good test!
Jonathan Leffler
Shoot. Didn't do my own walkthrough.
Bruce
+7  A: 

One version uses an array initialized with appropriate values, one byte per character in the code set (plus 1 to allow for EOF, which may also be passed to the classification functions):

static const char bits[257] = { ...initialization... };

int isupper(int ch)
{
    assert(ch == EOF || (ch >= 0 && ch <= 255));
    return((bits+1)[ch] & UPPER_MASK);
}

Note that the 'bits' can be used by all the various functions like isupper(), islower(), isalpha(), etc with appropriate values for the mask. And if you make the 'bits' array changeable at runtime, you can adapt to different (single-byte) code sets.

This takes space - the array.

The other version makes assumptions about the contiguousness of upper-case characters, and also about the limited set of valid upper-case characters (fine for ASCII, not so good for ISO 8859-1 or its relatives):

int isupper(int ch)
{
    return (ch >= 'A' && ch <= 'Z');  // ASCII only - not a good implementation!
}

This can (almost) be implemented in a macro; it is hard to avoid evaluating the character twice, which is not actually permitted in the standard. Using non-standard (GNU) extensions, it can be implemented as a macro that evaluates the character argument just once. To extend this to ISO 8859-1 would require a second condition, along the lines of:

int isupper(int ch)
{
    return ((ch >= 'A' && ch <= 'Z')) || (ch >= 0xC0 && ch <= 0xDD));
}

Repeat that as a macro very often and the 'space saving' rapidly becomes a cost as the bit masking has a fixed size.

Given the requirements of modern code sets, the mapping version is almost invariably used in practice; it can adapt at run-time to the current code set, etc, which the range-based versions cannot.

Jonathan Leffler
Thanks I will study this.Just one question, why didn't you use 'unsigned' keyword when you defined 'bits' variable?
paleman
Rather than asserting in the first implementation, shouldn't it simply return `0` if the passed parameter is outside the range that the `bits` array represents? There is no restriction on the value that can be passed to `isupper()` (other than being an `int` type).
Michael Burr
The legitimate values that may be passed to isupper() are the values of int that may be stored in an unsigned char or EOF (C99: Section 7.4: "the argument is an int, the value of which shall berepresentable as an unsigned char or shall equal the value of the macro EOF. If theargument has any other value, the behavior is undefined." I've defined the 'undefined' behaviour as 'assert'.
Jonathan Leffler
@paleman: laziness - mainly.
Jonathan Leffler
@Jonathan - I stand corrected.
Michael Burr
@Jonathan: is it certain that the value of the macro EOF is gonna be equal to -1 on every system?And what would be the value of the UPPER_MASK for ASCII and what for ISO 8859-1?
paleman
@paleman: In theory - no, the macro EOF can be a value other than -1. However, when you are designing 'the implementation', the value -1 for EOF is both conventional and convenient (so it is almost universally used). The value of UPPER_MASK would be of your choosing. The idea is that your initialization would characterize each of the 128 (256) characters for all the various classifications, probably using bits for 'upper', 'lower', 'digit', 'punctuation', 'control', 'space', 'hex', etc. You might even need more than 8 bits these days - so you'd use (unsigned) short instead of char.
Jonathan Leffler
Thanks, all of this is been very helpful.
paleman