views:

520

answers:

13

As a learning exercise, my three functions—ToggleCase, LowerCase and UpperCase—each expect a pointer to an ASCII char string, terminated by the null character; they work as expected. Are there more efficient or faster methods of accomplishing this task? Am I breaking any unspoken rules of good C coding? I've made use of macros because, I think, it makes the code look better and it is more efficient than function calls. Is this typical or overkill?

Please feel free to nit-pick and critique the code (but do be nice).

case_conversion.h

#define CASE_FLAG 32
#define a_z(c) (c >= 'a' && c <= 'z')
#define A_Z(c) (c >= 'A' && c <= 'Z')

void ToggleCase(char* c);
void LowerCase(char* c);
void UpperCase(char* c);

case_conversion.c

#include "case_conversion.h"

void ToggleCase(char* c)
{
 while (*c)
 {
  *c ^= a_z(*c) || A_Z(*c) ? CASE_FLAG : 0;
  c++;
 }
}
void LowerCase(char* c)
{
 while (*c)
 {
  *c ^= A_Z(*c) ? CASE_FLAG : 0;
  c++;
 }
}
void UpperCase(char* c)
{
 while (*c)
 {
  *c ^= a_z(*c) ? CASE_FLAG : 0;
  c++;
 }
}

Background

I have been a C# developer for most of my career (of about 9 years) but I have also worked a little in C/C++ when it's been required. In the near future my responsibilities will be shifting from being a primarily a C# dev to working with embedded systems applications written in C. In an effort to grapple with the various methods of writing readable code that is also highly efficient, I'm writing a few mostly pointless libraries for the sake of learning how to write good great C code.

One of the tasks that I've created for myself it to write a few functions used for converting text string to all upper or lower case, or toggling the case of the entire string. Yes, I know that there are functions that do this from within the standard C libraries. That's not the point. The point of the exercise is to learn how to write efficient, usable code, while conforming to community standards.

A: 

First thing is I'd say rename a_z and A_Z to something like is_ASCII_Lowercase and is_ASCII_Uppercase. It's not as C-ish, but it's easier to understand.

Also, the use of ^= and ?: works, but again, I find it less readable than a simple if-statement.

FrustratedWithFormsDesigner
+8  A: 

There are at least two major problems with your macros. Consider what happens if I call one of them like

a_z('a' + 1);

The call will not give correct results due to operator precedence. This is easy to fix using brackets:

#define a_z(c) ((c) >= 'a' && (c) <= 'z')

But they can also be called like this:

a_z(i++);

This call will increment i twice! And that is not easily fixable (if at all) in a macro. I would recommend using inline functions instead (if needed - see below).

The fastest way to convert between upper/lowercase I know of is using lookup tables. Of course, this trades memory for speed - pick your preference knowing your specific platform :-)

You need two arrays, one for either direction. Initialize them like

char toUpper[128]; // we care only about standard ASCII
for (int i = 0; i < 128; i++)
  toUpper[i] = i;
toUpper['a'] = 'A';
...
toUpper['z'] = 'Z';

And the conversion is trivial:

char toUpperCase(char c)
{
  return toUpper[c];
}

(for production code, this should be improved to extend the array to all possible char values on given platform (or shrink it to only the legal values and do parameter checking), but for illustration, this will do.)

Péter Török
@Peter Torok-- Ah, I forgot about the idea of a look-up table. That would be quick. Also, I'll modify my code and 'inject' error cases like you have mentioned for experimentation sake. Thanks for the comments. They are all quite useful.
RLH
for embedded systems lookup tables *may be* too large.
aaa
@aaa, just added a note about this while you were commenting :-)
Péter Török
check out me take 'char slot[] = { 0, 31, 63, 63 };*c = slot[*c/32] + *c%32;'. it almost works :-)
aaa
Question asked for optimal implementation. I think that branchfree will outperform table-lookup in the longrun because branchfree can process multiple chars simultaneously using SWAR/SIMD and that's not practical with tables because the table size grows exponentially per char processed.
Adisak
@RLH - lookup tables are often a good way to go. However, they do take up space, particularly in the processor cache (which might be quite small in an embedded system), and require an indexed dereference. An `if` that doesn't branch in the common case might actually be faster if you know that your strings are mostly characters ... which is yet another way to say "don't optimize unless you profile."
Anon
A: 

how about (almost works):

char slot[] = { 0, 31, 63, 63 };
*c = slot[*c/32] + *c%32;

Couple things you could change:

*c += a_z(*c)*CASE_FLAG; // adds either zero or three two
// you could also replace multiplication with the shift (1<<5) trick

strings are actually arrays:

char upper[] = "ABC..ABC..."; // 
...
*c = upper[*c+offset];

or

char upper[] = "ABC.."; // 
...
*c = upper[*c%32];

or

*c = 'A' + *c%32;

or whatever ...

aaa
A: 

Perhaps I've spent too much time with C++ and not enough with C, but I'm not a big fan of macros that have parameters... as Peter Torok points out they can lead to some problems. Your definition of CASE_FLAG is okay (it doesn't take any parameters), but I would replace the macros a_z and A_Z with functions instead.

andand
His macros do have problems in that they repeat their argument more than once. Used in his code this is alright, but generally it is bad (outside of a few compiler extensions that can sometimes be used to make it alright). With a good compiler inline functions would be fine for this, but with a bad compiler macros will make better code (though the original code is far from optimal).
nategoose
You can make macros out of my versions, which are careful not to evaluate their argument more than once. See my answer.
R..
A: 

Three #defines - benefit of the macros are they get inlined so no function-call overhead:

#define ToggleCase(str) char *s_=str;while(*s_) (*s_++)=((*s_>='A')&&(*s_<='Z'))?(*s_+32):(((*s_>='a')&&(*s_<='z'))?(*s_-32):*s_)

#define UpperCase(str) char *s_=str; while(*s_) (*s_++)=((*s_>='A')&&(*s_<='Z'))?(*s_+32):*s_

#define LowerCase(str) char *s_=str; while(*s_) (*s_++)=((*s_>='a')&&(*s_<='z'))?(*s_-32):*s_

or you can use inline functions:

inline void ToggleCase(char *str) { 
    char *s = str;
    while (*s) { 
        *s = ((*s>='a') && (*s<= 'z'))? (*s+32) : *s; 
        s++;
    }
}

inline void UpperCase(char *str) {
    char *s_=str; 
    while(*s) {
        *s = ((*s>='A')&&(*s<='Z')) ? (*s+32) : *s;
        s++;
    }
}

inline void LowerCase(char *str) {
    char *s_=str; 
    while(*s) {
        *s = ((*s>='a')&&(*s<='z')) ? (*s-32) : *s;
        s++;
    }
}

The thing with 'inline' is that the compiler takes the use of the statement as a suggestion, not as an imperative, so it may or may not actually inline the function, while with the macros, it has no other choice but to inline them.

Lastly: A very good source of correct programming techniques can be gotten from Thinking in C by Bruce Eckel - freely downloadable.

slashmais
downside is that they are completely unreadable
Idan K
Won't the compiler just inline them if that is proper... I don't see how a macro is justified in this example
ronag
@Idan K: It uses the ternary operator: boolean_test ? value_if_true : value_if_false - read it like that and it's clear.
slashmais
This is hideous. At least wrap them in `do { ... } while (0)` so they behave as functions. But on an embedded system, space is likely a more precious commodity than time, and anything that leads to the compiler generating more inline code when you don't *really* need to is a very very bad idea.
R..
Without the do-while wrapper that "R" suggested, you can't use multiple ToggleCase() macro calls within a single scope due to duplicate "char *s" variable definitions. As for the forced inlining from macros, it's really better to leave as function and let the programmer or compiler determine inlining priorities. However, I don't see how this answer deals with the question at all since other than the "inline" it offers nothing for speed enhancements and in fact uses quite slow branching comparisons and ternary conditional operations.
Adisak
+2  A: 

Ok, here goes. Writing on this tab ... scrolling through your code on the other tab :-)

header

  1. #define a_z(c) (c >= 'a' && c <= 'z')

    • the name of the function like macro should be in ALL CAPS (maybe IS_LOWERCASE) to alert users it's a macro
    • c in the expansion should be inside parenthesis to prevent strange side-effects
    • personal choice: I like to reorder the conditions to read more like the english 'a' <= c <= 'z' as (('a' <= (c)) && ((c) <= 'z'))
  2. I'd make the functions void ToggleCase(char* c) return a char* (the same that was sent in) to be able to use them in sequence: printf("%s\n", UpperCase(LowerCase("FooBar")));

source code

  1. The ternary operator does not make your code faster or easier to read. I'd write a simple if

That's it.

Oh! One more thing: Your code assumes ASCII (you said so yourself) but doesn't document that. I'd add a note about that in the header file.

pmg
R..
+1 for hating casts ... and thank you for the lesson about the unsigned trick
pmg
"The ternary operator does not make your code faster" -- pretty true. It's required for somethings like initialization of a constant but it can actually slow down your code if you use it on non-intrinsic types. This is because ?: can invoke type promotion, copy construction, and assignment construction, as well as generating temporaries to evaluate the expression for non-intrinsic types.
Adisak
+10  A: 

My favorites:

*c += (*c-'A'<26U)<<5; /* lowercase */
*c -= (*c-'a'<26U)<<5; /* uppercase */
*c ^= ((*c|32U)-'a'<26)<<5; /* toggle case */

Since your target will be embedded systems, you should be learning how to eliminate unnecessary code bloat, branches, etc. Your conditional for determining if an ascii character is alphabetical is 4 comparison/branch operations; mine is 1. I would recommend looking up some good resources on arithmetic and bit manipulation tricks.

Note: I changed the *32 operations to <<5 after posting my answer, because a number of embedded systems compilers are too poor to do this for you. When writing code for a good compiler, *32 would probably illustrate your intent better.

Edit: Regarding the charge that my code has too many implicit compiler-generated operations, I believe that's completely false. Here is the pseudo-asm any half-decent compiler should generate for the first line:

  1. Load *c and zero- or sign-extend it to fill an int-sized word (depending on whether plain char is signed or unsigned).
  2. Subtract the constant 26 using an unsigned (non-overflow-trapping) sub instruction.
  3. Conditional-jump past the rest of the code if the carry flag is not set.
  4. Else, add 32 to the value at *c.

Steps 2 and 3 may be combined on architectures that use a comparing-jump operation instead of flags. The only way I can see any significant behind-the-scenes costs cropping up is if the machine cannot directly address chars, or if it uses a nasty (sign/magnitude or ones complement) signed value representation, in which case the conversion to unsigned would be nontrivial. As far as I know, no modern embedded architecture has these problems; they're mostly isolated to legacy mainframes (and to a lesser extent, DSPs).

If anyone is worried about bad compilers actually performing arithmetic for the <<5, you might try:

if (*c-'A'<26U) *c+=32;

instead of my code. That's probably cleaner anyway, but I generally like avoiding statements so I can shove the code inside a loop condition or function-like macro.

Edit 2: By request, a branch-free version of the first line:

*c += (64U-*c & *c-91U)>>(CHAR_BIT*sizeof(unsigned)-5);

*c += (64U-*c & *c-91U) >> CHAR_BIT*sizeof(unsigned)-1 << 5;

In order for this to work reliably, c should have type unsigned char * and unsigned int should be strictly wider than unsigned char.

R..
pretty cool - didn't know about the <<5 ;)
slashmais
Note that while you don't have explicit branches in the code, many compilers will generate branches due to the '<' operation. Plus compiler-generated boolean-to-int is sometimes a slow op. Your method is "graceful" in code for using a simple unsigned compare to check for range (it forces signed chars to become unsigned ints). However, it has many implicit compiler generated operations that will probably be slower than an explicit branchfree version with no comparison operations and no boolean-int conversions.
Adisak
'Plus 1' kudos though for using '|32' to allow a single range check for toggling case. I'll definitely steal that idea if I ever write a branchfree toggle variant.
Adisak
@Adisak: As I said, my version involves one branch. You could eliminate that too using some fancy bitshifting since you know the possible range of values `*c` could have, but I think it would be more expensive on most platforms.
R..
@R: I work on video games. The platforms we work on have significant branch mispredict penalties. In fact even CPU's like ATOM have significant branch mispredict penalties. A single mispredict may cost the same amount of time as converting several characters without mispredicts. Also, in order to do SIMD/SWAR (process multiple chars at once for fastest results), you need to eliminate the branch.
Adisak
@R: Also, even without SIMD/SWAR methods, the branchfree version can be unrolled and interleaved to hide latencies resulting in a faster version.
Adisak
OK I'll add branch-free.
R..
Adisak
Any compiler which can't optimize *32 to shift-by-five is probably going to yield something really icky taking the result of a conditional and shifting it. Better would be something like c ^= (((unsigned char)(ch-'A')>=26)-1)
supercat
The branch free version I wrote has two subtracts, two ands, and a shift to compute the value added. The branch free version above has one less "and" operation but unfortunately doesn't work and to fix it requires adding another 'and' operation (and changing the shift slightly). This makes it equivalent in operations to a one line version of my function which is merely expanded for clarity.
Adisak
@supercat: that still has branches in it. Sorry, I'm really pushing branchfree as the fastest answer -- it's the only one that's unrollable or SIMD-ifyable.
Adisak
@Adisak: For a few seconds after I posted it, my answer accidentally had the ASCII value of `'z'+1` in place of `'Z'+1`, but I quickly fixed that. My code has no mask, so I'm confused what you're saying is wrong with any mask. It simply ands the high bits from subtraction in each direction and shifts them down into the ASCII case bit.
R..
Adisak
Fixed using one of them, after failing to find any ways to do it without adding an operation.
R..
@R..: Yeah, I'm pretty sure that the version I wrote had the minimum number of ops for branchfree. That's why I double-checked yours when it had one less. If you find a faster version, let me know. FWIW, I still prefer using an 'and' rather than the extra shift because due to barrel-shifter costs, some platforms only allow shift instructions to be issued from a particular pipeline slot (i.e. Atom) which makes interleaved unrolled versions (generated by some compilers) more likely to have stalls.
Adisak
supercat
@supercat: Yes some CPUs have predicate generation instructions and some have conditional selection as well. Even the modern X86 instruction set has been expanded to include some of these instructions. However, none of the common compilers use them :( You have to write asm for most of them. The thing about writing branchfree code is that it will generate code without branches even if your compiler is not smart enough to do so for you. This is necessary for the compiler to automatically unrolling and interleave loops. Also, it's necessary to use this technique to expand to SIMD.
Adisak
@supercat: If someone asks for an "optimal" function, I'm never comfortable leaving the actual performance in the hands of a possible magic compiiler optimization that might happen on some CPU's -- especially when I've noticed compilers don't generally perform that optimization for you.
Adisak
@Adisak: As an embedded developer, I've often had the joy of writing really goofy code to get around compiler silliness. Unfortunately, a nasty-looking rewrite which works beautifully on one compiler version may produce horrible code on another. On some machines, code using mixed data types will work nicely; on others it won't.
supercat
@supercat: BTW, the branchfree version you have in your reply won't work. The expression prior to shifting is an unsigned char and >>8 results in 0 for all unsigned chars.
Adisak
@supercat: The answer I provide isn't "silly", it's pretty standard for branchfree C/C++ techniques and will generate pretty good code on any modern compiler. Of course there are times to use branching code rather than branchfree... for example on a very old-school unpipelined CPU which has no prefetch and no branch mispredict penalty and also is missing a barrel shifter -- in this case branching will always be faster. It's always best to time things and look at compiler output but I've found branchfree code to be very fast in general.
Adisak
@Adisak: Your criticism of supercat's code is wrong. The lefthand side of the `>>8` is not an `unsigned char` but and `int` due to default promotions. There is no such thing as an expression of type `unsigned char` in C.
R..
@Adisak: I didn't say your code was silly. I said some compilers are silly. If a processor includes an instruction which yields one if a flag is set and zero if it's cleared, and if that instruction is faster than branching code, I would think a sensible compiler should make use of it instead of branching. Is there any good reason why a compiler shouldn't?
supercat
@Adisak: BTW, a number of processors like the ARM include a selective-instruction-skip ability. An expression such as that given could be done as an unconditonal subtract "A" to a temp register, a subtract-while-updating-flags of 26, and an XOR-if-carry of the original register with 32. Three instructions; three cycles. I don't know that a compiler would find that code sequence, but it's pretty good. If two registers are preloaded with constants, processing words would take 9 cycles/4 bytes, though a compiler wouldn't find that approach.
supercat
@supercat: No reason in theory a compiler shouldn't do that optimization. However, in practice, most compilers that I have used do not :-( To avoid the gap between theory and practice, if you want branchfree code in real world practice (rather than in theory), you need to write it so the compiler CAN NOT generate a branch -- not to write it so the compiler might be optimize out a branch. Branch free is a huge win on the optimization of this code. If you look at my post, I actually have a snapshot of asm output from a compiler showing how well interleaving works with branchfree.
Adisak
@Adisak: All arithmetic operators extend to type 'int' any operands which are smaller than an 'int'. The size of a result may be forced back down via typecast, but it will re-grow when the next arithmetic operation is performed.
supercat
@supercat: Yeah, I like the ARM "IT" (if-then) instruction which has a fast predicate to skip a couple following instructions. It's still worth knowing that on all superscalar pipelined ARM architectures, the interleaved branchfree version will be 3-4X faster than a version using IT instructions. That's before using SWAR/SIMD. Using 32-bit SWAR with branchfree, it will be 12-16X faster for the heart of the unrolled loop. Of course, with loop unrolling, you need loop prolog/epilogs so it might be worthwhile having short strings go through an "IT" path and long strings go through mine.
Adisak
OK, actually tested supercat's code in a compiler (best way to verify stuff). It works. I thought the promotion to integer was a new feature in C# but apparently it's been around forever in C/C++ as well. It's good to learn something new everyday. I've always just explicitly promoted (by casting) when I need an operation to be a larger size.
Adisak
@supercat: Oh, and the "IT" instruction on ARM is the rare exception where all the good compilers actually normally do generate a predicate instruction. They seem to not do it on X86 so much though :-(
Adisak
@Adisak: The ARM machines I use allow any function to be conditionally executed or skipped based upon the state of CPU flags, and each instruction may be either allowed to change CPU flags or not. There isn't a need for an explicit "If/Then" instruction. I don't know to what extent compilers can exploit conditional execution, but if an if/else clause would have three or fewer instructions in each condition, it's better to conditionally execute all (up to) six instructions than to branch. BTW, I posted a SIMD-style solution; what do you think of it?
supercat
@supercat: The if-then ("IT") instruction is the ARM version of predicates. It determines if the next N instructions execute based on CPU flags without branching. I think we're talking about the same thing. The IT instruction is part of Thumb-2 IS. http://en.wikipedia.org/wiki/ARM_architecture **"IT" (if-then) instruction, which permits up to four successive instructions to execute based on a tested condition.**
Adisak
Ah, I was thinking in terms of the ARM 32-bit opcodes rather than Thumb. The 32-bit opcodes include a 4-bit condition field which allows any instruction to be conditionally executed, as well as an "S" bit which indicates whether the instruction should affect status. I've not looked much at the Thumb instruction set, since the little bit of ARM programming I've done has been extremely speed-critical.
supercat
+4  A: 

NOTE: The Question Title was edited - original title was about optimization "Please Critique-- An optimal function for converting string cases in C" which explains why my answer deals with optimization only rather than generically "improving" the functions.

If you are really looking for the absolute fastest way to do it, a branch-free version is going to be the way to go in the long run because it can use SIMD. Plus it avoids having tables (which may be too large on an embedded system if memory is really cramped).

Here is a simple non-SIMD branch-free example and ToLower is a trivial change from this.

char BranchFree_AsciiToUpper(char inchar) 
{ 
        // Branch-Free / No-Lookup 
        // toupper() for ASCII-only 
        const int ConvertVal = 'A' - 'a'; 
        // Bits to Shift Arithmetic to Right : 9 == (char-bits + 1) 
        const int AsrBits = 9; 

        int c=(int)inchar; 
        //if( (('a'-1)<c) && (c<('z'+1)) ) { c += 'A'-'a'; } 
        int LowerBound = ('a'-1) - c; 
        int UpperBound = c - ('z' + 1); 
        int BranchFreeMask = (LowerBound & UpperBound)>>AsrBits;
        c = c + (BranchFreeMask & ConvertVal); 
        return((char)c); 
}

My function is expanded for clarity and uses non-hardcoded constants. You can do the same thing in a single line with hardcoded values but I like readable code; however, here is the "compressed" version of my algorithm. It's not any faster since it does the EXACT same thing "compressed" to one line.

c+=(((96-(int)c)&((int)c-123))>>9)&(-32);

There are a number of optimizations you can make here to make it even faster. You can hardcode more optimal numbers for ASCII because the example doesn't assume any encoding mapping other than a-z and A-Z are contiguous ranges. For example with ASCII, if you don't have a barrel shifter, you can actually change the AsrBits to 4 (9-5) since ConvertVal will be +/-32 depending on the toupper or tolower operation.

Once you have working branchfree versions, you can use SIMD or bit-twiddling SWAR (SIMD Within A Register) techniques to convert 4-16 bytes at a time (or even possibly more depending how wide your registers go and if you unroll to hide latency). This will be much faster than any lookup method which is pretty much limited to single-byte conversion unless you have immensely large tables which grow exponential per byte processed simultaneously.

Also, you can generate the branchfree predicate without using int upcasting but then you have to do a couple more operations (with upcasting it's just one subtract per range). You may need to do the expanded operations for SWAR but most SIMD implementations have a compare operation that will generate a mask for you for free.

The SWAR/SIMD operations also can benefit from fewer reads/writes to memory and the writes that do occur can be aligned. This is much faster on processors that have load-hit-store penalties (such as the PS3 Cell Processor). Combine this with simple prefetching in an unrolled version and you can avoid memory stalls nearly altogether.

I know it seems like there is a lot of code in my example there but there are ZERO branches (implicit or explicit) and no branch mispredictions as a result. If you are on a platform with significant branch mispredict penalties (which is true for many pipelined embedded processor), then even without SIMD, your optimized release build of the above code should run faster than something that seems much less complicated but creates implicit branches.

Even without SIMD/SWAR, it is possible for a smart compiler to unroll and interleave the above implementation to hide latencies and result in a very fast version - especially on modern superscalar processors that can issue more than one non-dependent instruction per cycle. This is not usually possible with any of the branching versions.

If you manually unroll, I would group loads and gather stores to make it easier for the compiler to interleave the non-branching non-dependent instructions in between. Example:

// Unrolled inner loop where 'char *c' is the string we're converting
char c0=c[0],c1=c[1],c2=c[2],c3=c[3];  // Grouped-Loads
c[0]=BranchFree_AsciiToUpper(c0);
c[1]=BranchFree_AsciiToUpper(c1);
c[2]=BranchFree_AsciiToUpper(c2);
c[3]=BranchFree_AsciiToUpper(c3);
c+=4;

A decent compiler should be able to inline the ToUpper and fully interleave the above code since there are no branches, no memory aliasing, and no codependent instructions between each one. Just for kicks, I decided to actually compile this and a compiler that targetted PowerPC generated perfect interleaving for dual-issue superscalar core that will easily outperform any code with branches.

mr               r31,r3
mr               r13,r13
lbz              r11,0(r31)
lbz              r10,1(r31)
extsb            r11,r11
lbz              r9,2(r31)
extsb            r10,r10
lbz              r8,3(r31)
subfic           r7,r11,96
addi             r6,r11,-123
srawi            r5,r7,9
srawi            r4,r6,9
subfic           r3,r10,96
addi             r7,r10,-123
extsb            r9,r9
srawi            r6,r3,9
srawi            r3,r7,9
subfic           r7,r9,96
addi             r30,r9,-123
extsb            r8,r8
srawi            r7,r7,9
srawi            r30,r30,9
subfic           r29,r8,96
addi             r28,r8,-123
srawi            r29,r29,9
srawi            r28,r28,9
and              r5,r5,r4
and              r3,r6,r3
and              r7,r7,r30
and              r30,r29,r28
clrrwi           r4,r5,5
clrrwi           r6,r7,5
clrrwi           r5,r3,5
clrrwi           r7,r30,5
add              r4,r4,r11
add              r3,r5,r10
add              r11,r6,r9
stb              r4,0(r31)
add              r10,r7,r8
stb              r3,1(r31)
stb              r11,2(r31)
stb              r10,3(r31)

The proof is in the pudding and the above compiled code gonna be really fast compared to branching versions even before going to SWAR or SIMD.

In summary, reasons why this should be the fastest method:

  1. No branch misprediction penalties
  2. Ability to SIMD-ify algorithm for 4-16 (or more) bytes at a time
  3. Compiler (or programmer) can unroll and interleave to eliminate latencies and to take advantage of superscalar (multi-issue) processors
  4. No memory latencies (i.e. table lookups)
Adisak
A: 

My approach is "trim only when needed".

Depending on your system and your cpu architecture, a lot of things could be done differently.

There are a few design points I would have in relation to your code. First, macros. Macros have some brutal pitfalls, and should be used with care. Second, the use of a global to toggle case. I would rewrite to look something like this -

 enum CASE {UPPER, LOWER};

void ToggleCase(char* c, CASE newcase)
{
    if(newcase == UPPER)
       UpperCase(c);
    else if(newcase == LOWER)
       LowerCase(c);
    else 
       { ; } //null
}

In the micro-efficiency sense, this adds about 1 extra instruction per call. There is also some branching that can happen, which can potentially cause a cache miss.

void LowerCase(char* c)
{
  while (*c++)  //standard idiom for moving through a string.
  {
    *c = *c < 'Z' ? *c + 32 : *c;
  }
}


void UpperCase(char* c)
{
  while (*c++)
  {
    *c = *c > 'a' ? *c - 32 : *c;
  }
}

Now, there are some criticisms of my code.

First, it's branchy. Second, it's assuming that the input is [a-zA-Z]+. Third, it's ASCII-only (What about EBDIC?). Fourth, it's assuming null termination (Some strings have some characters at the start of the string - Pascal I think). Fifth, it's not 100% naively obvious that the code upper/lowercases. Also note that an ENUM is a badly-veiled integer. You can pass ToggleCase("some string", 1024) and it will compile.

These things aren't to say my code is super bad. It serves and will serve - just under some conditions.

Paul Nathan
A: 

I've made use of macros because, I think, it makes the code look better and it is more efficient than function calls.

Is it more efficient? What are your requirements on code size? (For the generated executable code, not the C source code.) On modern desktop systems, that's rarely an issue and speed matters much more; but you've not given us any more details beyond "embedded systems applications," so there's no way we can answer this for you. However, it's not an issue here, because the code inside the macros is truly so small—but you can't assume that avoiding function calls is always more efficient!

You can use inline functions, if you're allowed. They've been officially part of C since '99, but supported for far longer in several compilers. Inline functions are much cleaner than macros, but, again depending on your exact target requirements, it can be difficult to predict the generated code from the source. More commonly, however, people are stuck with outdated (now over a decade!) C compilers that don't support them.

In short, you always have to know your exact requirements to determine what's optimal. And then you have to test to verify your performance predictions.

Roger Pate
+1  A: 

I hesitated to answer this because it's been over 20 years since I worked with small devices. However, I think the rules are pretty much the same (with one possible addition):

  1. Minimize memory accesses
  2. Minimize CPU cycles
  3. Minimize code size

When I was developing low-level code, rule #1 overshadowed all the others. There wasn't any on-board cache, and memory was incredible slow relative to the processor; that's the reason that the "register" storage class exists in C. Today the situation has changed somewhat, but it's still one of the top two concerns. As I commented on one post, a lookup table is a good idea, but recognize that it means an extra memory access for each test. Once it gets into cache that may not be an issue, but you're going to pay the price of a several cache hits each time you enter the function (unless you're calling it so often that the lookup table can remain in cache).

Rule number 2 seems like "duh, of course you want to do that, why isn't it rule #1?" but the reasoning actually goes deeper. In fact, in some ways it's a restatement of rule #1, since each instruction has to be fetched from memory before it can be executed. There's a delicate tradeoff: on an integer-only processor, it's a clear win to use a lookup table to compute trigonometric functions; on a chip with embedded floating point, maybe not.

I'm not sure that rule #3 still applies. In my experience, there was always the scramble to cut code, fit the proverbial 20 pounds into a 10 pound sack. But it seems that today the smallest sack is 50 pounds. However, even with a 50 pound sack (or many-megabyte ROM) to hold your code/data, you still need to pull it into the cache (if you have one).

The new rule #1: keep the pipeline full

Modern processors have deep instruction pipelines (if you're not familiar with this term, see this article: http://arstechnica.com/old/content/2004/09/pipelining-1.ars/1). The general rule of thumb with deep pipelines is that the branching -- an "if" test -- is expensive, because it means that the pipeline might have to be flushed to load in the new code. So you write your code to branch on the unlikely case (see Adisak's post for a perhaps-justified no-branch implementation; +1 if I could).

Someone with more recent experience than me will probably comment, and say "modern processors load the pipeline with both branches, so there's no cost penalty."Which is all well and good, but it brings up an overarching rule:

Rule 0: optimization depends on your architecture and workload

The microprocessor inside my dishwasher probably doesn't have a pipeline and maybe doesn't have a cache. Of course, it's probably not going to do a lot of text processing either. Or maybe it has both; it seems that there are only a few major embedded processors in the market, so maybe there's a Pentium on that circuit board rather than an 8051 derivative. Even so, there's a wide range even within the Pentium-based embedded processors (http://en.wikipedia.org/wiki/List_of_Intel_Pentium_microprocessors#Embedded_processors). What is best for one might not be best for another.

Then there's the issue of what sort of data you're processing. If you're processing text, then it's likely (but not guaranteed) that most of your data will be letters, versus numbers or punctuation; so you can optimize for that.

However, there's more: I commented "ASCII only, huh?" on the OP; another commenter was more explicit: if you're processing text in 2010, you probably aren't processing ASCII. At the very least, you'll be dealing with ISO-8859-1 or a similar 8-bit character set. And in this case, maybe a no-branch or smart-branch (paying attention to the pipeline) solution will still be faster than a lookup table (yes, that's an assumption on my part). But if you're dealing with Unicode BMP (16 bits), you'll pretty much have to use the table, whatever its cost in terms of memory, because there are no easy rules to determine what's lower- versus upper-case. And if you're dealing with the higher planes of Unicode ... well, maybe capitalization of "Old Italics" isn't that important (especially because it doesn't have upper- and lower-case).

Ultimately, the only way to know for sure is to profile given realistic workloads.

Finally: Clear Code FTW

This post started when I wrote a comment to the OP that his/her use of macros was a bad idea (and couldn't enter it because SO went into maintenance mode). Peter Torok (sorry, I don't support Unicode or even ISO-8859-1) gave one reason, but there's another: they're black boxes.

The OP looks nice and clean: short code, heavy use of bitwise and ternary operators, easy to understand if you understand the language. But it would have been a lot easier to understand the actual work if you saw A_Z in its expanded form. That might have gotten you thinking about how much branching you were doing, particular in the ToggleCase method. And then you might have given some thought to how you could re-arrange those branches to minimize the number of actual tests that you're doing. And perhaps given some thought to maintaining the pipeline.

Anon
A: 
supercat
You could mask on the high bit and use that as a pass-thru selection and then it would work for all values 0-255. But if you knew your ASCII, was 0-127 you could skip the extra work :-)
Adisak