NOTE: The Question Title was edited - original title was about optimization "Please Critique-- An optimal function for converting string cases in C" which explains why my answer deals with optimization only rather than generically "improving" the functions.
If you are really looking for the absolute fastest way to do it, a branch-free version is going to be the way to go in the long run because it can use SIMD. Plus it avoids having tables (which may be too large on an embedded system if memory is really cramped).
Here is a simple non-SIMD branch-free example and ToLower is a trivial change from this.
char BranchFree_AsciiToUpper(char inchar)
{
// Branch-Free / No-Lookup
// toupper() for ASCII-only
const int ConvertVal = 'A' - 'a';
// Bits to Shift Arithmetic to Right : 9 == (char-bits + 1)
const int AsrBits = 9;
int c=(int)inchar;
//if( (('a'-1)<c) && (c<('z'+1)) ) { c += 'A'-'a'; }
int LowerBound = ('a'-1) - c;
int UpperBound = c - ('z' + 1);
int BranchFreeMask = (LowerBound & UpperBound)>>AsrBits;
c = c + (BranchFreeMask & ConvertVal);
return((char)c);
}
My function is expanded for clarity and uses non-hardcoded constants. You can do the same thing in a single line with hardcoded values but I like readable code; however, here is the "compressed" version of my algorithm. It's not any faster since it does the EXACT same thing "compressed" to one line.
c+=(((96-(int)c)&((int)c-123))>>9)&(-32);
There are a number of optimizations you can make here to make it even faster. You can hardcode more optimal numbers for ASCII because the example doesn't assume any encoding mapping other than a-z and A-Z are contiguous ranges. For example with ASCII, if you don't have a barrel shifter, you can actually change the AsrBits to 4 (9-5) since ConvertVal will be +/-32 depending on the toupper or tolower operation.
Once you have working branchfree versions, you can use SIMD or bit-twiddling SWAR (SIMD Within A Register) techniques to convert 4-16 bytes at a time (or even possibly more depending how wide your registers go and if you unroll to hide latency). This will be much faster than any lookup method which is pretty much limited to single-byte conversion unless you have immensely large tables which grow exponential per byte processed simultaneously.
Also, you can generate the branchfree predicate without using int upcasting but then you have to do a couple more operations (with upcasting it's just one subtract per range). You may need to do the expanded operations for SWAR but most SIMD implementations have a compare operation that will generate a mask for you for free.
The SWAR/SIMD operations also can benefit from fewer reads/writes to memory and the writes that do occur can be aligned. This is much faster on processors that have load-hit-store penalties (such as the PS3 Cell Processor). Combine this with simple prefetching in an unrolled version and you can avoid memory stalls nearly altogether.
I know it seems like there is a lot of code in my example there but there are ZERO branches (implicit or explicit) and no branch mispredictions as a result. If you are on a platform with significant branch mispredict penalties (which is true for many pipelined embedded processor), then even without SIMD, your optimized release build of the above code should run faster than something that seems much less complicated but creates implicit branches.
Even without SIMD/SWAR, it is possible for a smart compiler to unroll and interleave the above implementation to hide latencies and result in a very fast version - especially on modern superscalar processors that can issue more than one non-dependent instruction per cycle. This is not usually possible with any of the branching versions.
If you manually unroll, I would group loads and gather stores to make it easier for the compiler to interleave the non-branching non-dependent instructions in between. Example:
// Unrolled inner loop where 'char *c' is the string we're converting
char c0=c[0],c1=c[1],c2=c[2],c3=c[3]; // Grouped-Loads
c[0]=BranchFree_AsciiToUpper(c0);
c[1]=BranchFree_AsciiToUpper(c1);
c[2]=BranchFree_AsciiToUpper(c2);
c[3]=BranchFree_AsciiToUpper(c3);
c+=4;
A decent compiler should be able to inline the ToUpper and fully interleave the above code since there are no branches, no memory aliasing, and no codependent instructions between each one. Just for kicks, I decided to actually compile this and a compiler that targetted PowerPC generated perfect interleaving for dual-issue superscalar core that will easily outperform any code with branches.
mr r31,r3
mr r13,r13
lbz r11,0(r31)
lbz r10,1(r31)
extsb r11,r11
lbz r9,2(r31)
extsb r10,r10
lbz r8,3(r31)
subfic r7,r11,96
addi r6,r11,-123
srawi r5,r7,9
srawi r4,r6,9
subfic r3,r10,96
addi r7,r10,-123
extsb r9,r9
srawi r6,r3,9
srawi r3,r7,9
subfic r7,r9,96
addi r30,r9,-123
extsb r8,r8
srawi r7,r7,9
srawi r30,r30,9
subfic r29,r8,96
addi r28,r8,-123
srawi r29,r29,9
srawi r28,r28,9
and r5,r5,r4
and r3,r6,r3
and r7,r7,r30
and r30,r29,r28
clrrwi r4,r5,5
clrrwi r6,r7,5
clrrwi r5,r3,5
clrrwi r7,r30,5
add r4,r4,r11
add r3,r5,r10
add r11,r6,r9
stb r4,0(r31)
add r10,r7,r8
stb r3,1(r31)
stb r11,2(r31)
stb r10,3(r31)
The proof is in the pudding and the above compiled code gonna be really fast compared to branching versions even before going to SWAR or SIMD.
In summary, reasons why this should be the fastest method:
- No branch misprediction penalties
- Ability to SIMD-ify algorithm for 4-16 (or more) bytes at a time
- Compiler (or programmer) can unroll and interleave to eliminate latencies and to take advantage of superscalar (multi-issue) processors
- No memory latencies (i.e. table lookups)