ansaurus

Question

Does anyone know of a C compiler that fails to compile this?

Answer 1

+2 A:

I get warnings, but no errors (gcc). Seems to compile and operate fine. May not work for big-endian systems, though!

I wouldn't suggest this method, though. Perhaps you can xor instead of or-shift, to create a single byte. Then use the case statement on a byte (or, faster, use a LUT of the first N bits).

strager 2008-11-30 00:01:29

That's effectively a perfect hash. It doesn't matter too much for this application, but the downside of a perfect hash is that it can return wrong answers for bad input.

Dustin 2008-11-30 00:06:55

@Dustin, then you can do a strcmp once you have your possible answer to check it. Use a LUT for this. Easy.

strager 2008-11-30 00:09:04

One of my earlier attempts at this was to write a program that took a set of words as input and generated a hierarchical case statement switching on letters before doing a strcmp. For example: http://pastebin.com/f30e09462 (optionally breaking the Js down further).

Dustin 2008-11-30 00:13:53

@Dustin, you can perhaps break your switch up into further switches. This should be faster than calling memcmp. If you don't want to do that, you can take advantage of the fact that you already know the first character, and don't need to compare it.

strager 2008-11-30 00:41:07

@strager I wrote this thing to build that for me: http://github.com/dustin/snippets/tree/master/python/misc/gensearch.py

Dustin 2008-11-30 00:41:17

Answer 2

+3 A:

There are at least 3 things that keep this program from being portable:

Multi-character constants are implementation-defined so different compilers may handle them differently.
A byte can be more than 8 bits, there is plenty of hardware where the smallest addressable unit of memory is 16 or even 32 bits, you often find this in DSPs for example. If a byte is more than 8 bits then so will char since char is by definition one byte long; your program will not function properly on such systems.
Lastly, there are many machines where int is only 16-bits (which is the smallest size allowed for int) including embedded devices and legacy machines, your program will fail on these machines as well.

Robert Gamble 2008-11-30 00:01:49

My 64-bit processor has 32-bit int's, and it works fine. Maybe you meant less than?

strager 2008-11-30 00:02:45

Really or the BYTE thing? AFAIK byte is by definition 8 bits

JaredPar 2008-11-30 00:03:17

@JaredPar: A byte in C has at least 8 bits but may have more. There are plenty of systems around today (new ones too!) with bytes larger than 8 bits.

Robert Gamble 2008-11-30 00:06:29

Okay, you're using the C definition (notoriously vague). I've gotten to the point now that I use explicit sized types in my code to stop confusing other people. Ex: __int16, __int32, etc ...

JaredPar 2008-11-30 00:18:40

@JaredPar: There is plenty of hardware where the smallest addressable unit of memory is 16 or 32 bits, this is not just a C definition thing. On such systems a char will be 16 or 32 bits and an int may be only 1 byte. The world isn't a Windows PC ;)

Robert Gamble 2008-11-30 00:32:04

@Robert Gamble, Doesn't the C spec say the size order must be: char <= short <= int <= long ?

strager 2008-11-30 00:40:02

@strager, yes and on a system where a byte is 32 bits: char, short, int and long could all be 1 byte.

Robert Gamble 2008-11-30 00:51:47

@strager: specifically, there were Cray machines where sizeof(char) == sizeof(short) == sizeof(int); I believe they were all 32-bit quantities.

Jonathan Leffler 2008-11-30 21:15:06

Answer 3

+9 A:

I only know what the C Standard says about this (C99):

The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-deﬁned. If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.

(6.4.4.4/10 taken from a draft)

So it's implementation defined. Meaning it is not guaranteed it works the same everywhere, but the behavior must be documented by the implementation. For example if int is only 16 bits wide in a particular implementation, then 'Jan/' can't be represented anymore like you intend it (char must be at least 8 bits, while a character literal is always of type int).

Johannes Schaub - litb 2008-11-30 00:06:08

It's theoretically wrong, I understand. I'm wondering if there are any C compilers where it doesn't work in practice. It's completely out of curiosity as it worked for me even on an ancient compiler that fails on over half the open source stuff I find.

Dustin 2008-11-30 00:10:09

@Dustin, try compiling for PPC (non-Intel Mac OS X). I think that's big-endian.

strager 2008-11-30 00:11:53

@strager Did... I ran this on my G3 as well as my SGI (which is sort of bidirectional, but runs IRIX in big-endian).

Dustin 2008-11-30 00:14:56

@Dustin, I have an old 8051 compiler that fails to compile this - it only had 16-bit ints and it gags on the 'xxxx' constants (it would also fail at runtime with the shift-or code). In fact it also gags on 'xx' constants.

paxdiablo 2008-11-30 00:21:56

Dustin. you will probably not figure out by compiler, but by their backends. i can change by GCC backend to make int 16bits wide, and then gcc-eco32 will miscompile it.

Johannes Schaub - litb 2008-11-30 00:24:19

@Pax neat, thanks. I suppose I should've qualified it with a 32-bit compiler, but the second failure is half of the type of error I was looking for.

Dustin 2008-11-30 00:26:25

In practice? For example the world's first C compiler? (16-bit ints.)

Windows programmer 2008-12-02 04:11:13

Answer 4

A:

The fact that a four character constant is equivalent to an particular 32-bit integer is a non-standard feature often seen on compilers for MS Windows and Mac computers (and PalmOS, AFAICR).

On theses systems a four character string is commonly used as a tag for identifying chunks of data files, or as an application / data-type identifier (e.g. "APPL").

It's a convenience then for the developer that they can store such a string into various data-structures without worrying about zero-byte termination, pointers, etc.

Alnitak 2008-11-30 00:12:47

Answer 5

A:

Machine word size issues aside, your compiler may promote input[i] to a negative integer which will just set the upper bits of inputInt with or operation, so I suggest you to be explicit about signedness of char variables.

But since in US, no one cares about the 8th bit, it is probably a non-issue for you.

artificialidiot 2008-11-30 00:13:27

IBM cares a great deal about the 8th bit, they even do Unicode :-)

paxdiablo 2008-11-30 01:01:30

Answer 6

+9 A:

if ( !input[0] || !input[1] || !input[2] || input[3] != '/' )
    return -1;

switch ( input[0] )
{
    case 'F': return 1; // Feb
    case 'S': return 8; // Sep
    case 'O': return 9; // Oct
    case 'N': return 10; // Nov
    case 'D': return 11; // Dec;
    case 'A': return input[1] == 'p' ? 3 : 7; // Apr, Aug
    case 'M': return input[2] == 'r' ? 2 : 4; // Mar, May
    default: return input[1] == 'a' ? 0 : (input[2] == 'n' ? 5 : 6); // Jan, Jun, Jul
}

Slightly less readable and not so much validating, but perhaps even faster, no?

Vilx- 2008-11-30 00:14:43

Typo on the last line ('a' ? 0 -- not 1), but yes, this is slightly over twice as fast on my machine. Where performance matters, you win. I'm wondering where this isn't readable enough, though.

Dustin 2008-11-30 00:22:44

@Dustin, it's readable if you add comments explaining it - won't affect run speed, will make it as readable as yours for the next bod looking at it. I would make sure the 31-day months are listed first since that would give a tiny bit of extra speed (statistically).

paxdiablo 2008-11-30 00:27:31

Assuming the case statements are compiled in the given order, that is.

paxdiablo 2008-11-30 00:28:09

@Pax Agreed. I didn't find it particularly confusing and my test pointed out a typo that led to an incorrect return. This may just be a case where the right answer to my question is that I asked the wrong question.

Dustin 2008-11-30 00:35:35

I'm not convinced that 'Zax' should be accepted as January; nor that Xxx should be accepted as July.

Jonathan Leffler 2008-11-30 21:02:52

As I said - it's not validating. If you want validation, then these are note the lines are looking for.

Vilx- 2008-12-06 16:47:14

OK - the original code was validating (rv is initialized to -1, though it would be clearer if that was in a default case in the switch), so you're addressing a different problem.

Jonathan Leffler 2008-12-13 04:14:40

Answer 7

+8 A:

You're just computing a hash of those four characters. Why not predefine some integer constants that compute the hash in the same way and use those? Same readability and you're not depending on any implementation specific idiosyncrasies of the compiler.

uint32_t MONTH_JAN = 'J' << 24 + 'a' << 16 + 'n' << 8 + '/';
uint32_t MONTH_FEB = 'F' << 24 + 'e' << 16 + 'b' << 8 + '/';

...

static uint32_t parseMonth(const char *input) {
    uint32_t rv=-1;
    uint32_t inputInt=0;
    int i=0;

    for(i=0; i<4 && input[i]; i++) {
        inputInt = (inputInt << 8) | (input[i] & 0x7f); // clear top bit
    }

    switch(inputInt) {
        case MONTH_JAN: rv=0; break;
        case MONTH_FEB: rv=1; break;

        ...
    }

    return rv;
}

tvanfosson 2008-11-30 00:38:53

I do like this one aesthetically. I think it looks just as nice as what I made, but can be made to work more consistently and without the occasional compiler warning. Thanks.

Dustin 2008-11-30 01:54:37

Answer 8

+10 A:

Solaris 10 - SPARC - SUN Compiler.

Test code:

#include <stdio.h>

static int parseMonth(const char *input) {
    int rv=-1;
    int inputInt=0;
    int i=0;

    for(i=0; i<4 && input[i]; i++) {
        inputInt = (inputInt << 8) | input[i];
    }

    switch(inputInt) {
        case 'Jan/': rv=0; break;
        case 'Feb/': rv=1; break;
        case 'Mar/': rv=2; break;
        case 'Apr/': rv=3; break;
        case 'May/': rv=4; break;
        case 'Jun/': rv=5; break;
        case 'Jul/': rv=6; break;
        case 'Aug/': rv=7; break;
        case 'Sep/': rv=8; break;
        case 'Oct/': rv=9; break;
        case 'Nov/': rv=10; break;
        case 'Dec/': rv=11; break;
    }

    return rv;
}

static const struct
{
    char *data;
    int   result;
} test_case[] =
{
    { "Jan/", 0 },
    { "Feb/", 1 },
    { "Mar/", 2 },
    { "Apr/", 3 },
    { "May/", 4 },
    { "Jun/", 5 },
    { "Jul/", 6 },
    { "Aug/", 7 },
    { "Sep/", 8 },
    { "Oct/", 9 },
    { "Nov/", 10 },
    { "Dec/", 11 },
    { "aJ/n", -1 },
};

#define DIM(x) (sizeof(x)/sizeof(*(x)))

int main(void)
{
    size_t i;
    int    result;

    for (i = 0; i < DIM(test_case); i++)
    {
        result = parseMonth(test_case[i].data);
        if (result != test_case[i].result)
            printf("!! FAIL !! %s (got %d, wanted %d)\n",
                   test_case[i].data, result, test_case[i].result);
    }
    return(0);
}

Results (GCC 3.4.2 and Sun):

$ gcc -O xx.c -o xx
xx.c:14:14: warning: multi-character character constant
xx.c:15:14: warning: multi-character character constant
xx.c:16:14: warning: multi-character character constant
xx.c:17:14: warning: multi-character character constant
xx.c:18:14: warning: multi-character character constant
xx.c:19:14: warning: multi-character character constant
xx.c:20:14: warning: multi-character character constant
xx.c:21:14: warning: multi-character character constant
xx.c:22:14: warning: multi-character character constant
xx.c:23:14: warning: multi-character character constant
xx.c:24:14: warning: multi-character character constant
xx.c:25:14: warning: multi-character character constant
$ ./xx
$ cc -o xx xx.c
$ ./xx
!! FAIL !! Jan/ (got -1, wanted 0)
!! FAIL !! Feb/ (got -1, wanted 1)
!! FAIL !! Mar/ (got -1, wanted 2)
!! FAIL !! Apr/ (got -1, wanted 3)
!! FAIL !! May/ (got -1, wanted 4)
!! FAIL !! Jun/ (got -1, wanted 5)
!! FAIL !! Jul/ (got -1, wanted 6)
!! FAIL !! Aug/ (got -1, wanted 7)
!! FAIL !! Sep/ (got -1, wanted 8)
!! FAIL !! Oct/ (got -1, wanted 9)
!! FAIL !! Nov/ (got -1, wanted 10)
!! FAIL !! Dec/ (got -1, wanted 11)
$

Note that the last test case still passed - that is, it generated a -1.

Here's a revised - more verbose - version of parseMonth() which does work the same under both GCC and Sun C compiler:

#include <stdio.h>

/* MONTH_CODE("Jan/") does not reduce to an integer constant */
#define MONTH_CODE(x)   ((((((x[0]<<8)|x[1])<<8)|x[2])<<8)|x[3])

#define MONTH_JAN       (((((('J'<<8)|'a')<<8)|'n')<<8)|'/')
#define MONTH_FEB       (((((('F'<<8)|'e')<<8)|'b')<<8)|'/')
#define MONTH_MAR       (((((('M'<<8)|'a')<<8)|'r')<<8)|'/')
#define MONTH_APR       (((((('A'<<8)|'p')<<8)|'r')<<8)|'/')
#define MONTH_MAY       (((((('M'<<8)|'a')<<8)|'y')<<8)|'/')
#define MONTH_JUN       (((((('J'<<8)|'u')<<8)|'n')<<8)|'/')
#define MONTH_JUL       (((((('J'<<8)|'u')<<8)|'l')<<8)|'/')
#define MONTH_AUG       (((((('A'<<8)|'u')<<8)|'g')<<8)|'/')
#define MONTH_SEP       (((((('S'<<8)|'e')<<8)|'p')<<8)|'/')
#define MONTH_OCT       (((((('O'<<8)|'c')<<8)|'t')<<8)|'/')
#define MONTH_NOV       (((((('N'<<8)|'o')<<8)|'v')<<8)|'/')
#define MONTH_DEC       (((((('D'<<8)|'e')<<8)|'c')<<8)|'/')

static int parseMonth(const char *input) {
    int rv=-1;
    int inputInt=0;
    int i=0;

    for(i=0; i<4 && input[i]; i++) {
        inputInt = (inputInt << 8) | input[i];
    }

    switch(inputInt) {
        case MONTH_JAN: rv=0; break;
        case MONTH_FEB: rv=1; break;
        case MONTH_MAR: rv=2; break;
        case MONTH_APR: rv=3; break;
        case MONTH_MAY: rv=4; break;
        case MONTH_JUN: rv=5; break;
        case MONTH_JUL: rv=6; break;
        case MONTH_AUG: rv=7; break;
        case MONTH_SEP: rv=8; break;
        case MONTH_OCT: rv=9; break;
        case MONTH_NOV: rv=10; break;
        case MONTH_DEC: rv=11; break;
    }

    return rv;
}

static const struct
{
    char *data;
    int   result;
} test_case[] =
{
    { "Jan/", 0 },
    { "Feb/", 1 },
    { "Mar/", 2 },
    { "Apr/", 3 },
    { "May/", 4 },
    { "Jun/", 5 },
    { "Jul/", 6 },
    { "Aug/", 7 },
    { "Sep/", 8 },
    { "Oct/", 9 },
    { "Nov/", 10 },
    { "Dec/", 11 },
    { "aJ/n", -1 },
    { "/naJ", -1 },
};

#define DIM(x) (sizeof(x)/sizeof(*(x)))

int main(void)
{
    size_t i;
    int    result;

    for (i = 0; i < DIM(test_case); i++)
    {
        result = parseMonth(test_case[i].data);
        if (result != test_case[i].result)
            printf("!! FAIL !! %s (got %d, wanted %d)\n",
                   test_case[i].data, result, test_case[i].result);
    }
    return(0);
}

I wanted to use MONTH_CODE() but the compilers did not cooperate.

Jonathan Leffler 2008-11-30 01:17:04

Excellent. My failure is your success. :)

Dustin 2008-11-30 01:28:46

Or, your success is @Leffler's failer. =]

strager 2008-11-30 01:33:16

Is SPARC is big-endian? If so, reversing the characters (or the loop ordering) ought to fix it.

Mr Fooz 2008-11-30 01:34:54

SPARC is big-endian.

Jonathan Leffler 2008-11-30 01:41:07

I'm very intrigued at the difference between GCC and Sun C. Granted this is at best implementation defined behaviour, but normally, GCC emulates the native compilers pretty well. I don't regard it as a bug, though - not in either compiler. It is simply 'implementation specific' behaviour and AOK.

Jonathan Leffler 2008-11-30 01:42:40

Also, I tried: "/naJ" wanting -1 and got back a 'test failure' with the function returning 0. So, 4-byte word reverse ordering lets the test pass. But it hardly resolves the problem - that the parseMonth() code is not reliable across platforms.

Jonathan Leffler 2008-11-30 01:45:38

@Jonathan - agreed. People have already posted a couple faster solutions, or solutions just as pretty to my problem. I have no excuse to leave this code in, pretty as I think it is, when there are real compilers I might encounter that will fail on it.

Dustin 2008-11-30 01:52:15

@Dustin: see my revision which provides a portable solution.

Jonathan Leffler 2008-11-30 02:03:33

@Jonathan Great. Should MONTH_CODE be used on the input instead of the loop?

Dustin 2008-11-30 02:08:06

@Dustin - yes, you could do that; it would probably be marginally faster.

Jonathan Leffler 2008-11-30 02:26:38

A MONTH_CODE variant which probably would work: #define MONTH_CODE(a,b,c,d) (((uint32_t)(a) << 24) | ((uint32_t)(b) << 16) | ((uint32_t)(c) << 8) | (uint32_t)(d))

CesarB 2008-11-30 15:35:36

@CesarB; yes, that should work - invoked MONTH_CODE('J','a','n','/'); it would be tempting to hard-code the '/' and reduce the parameters to three.

Jonathan Leffler 2008-11-30 16:55:46

Technically, of course, the Sun C compiler does compile the code fine; it just doesn't work the same as it does under GCC or on Intel chips. So, this isn't quite an accurate answer for the question. :D

Jonathan Leffler 2008-12-01 01:32:57

Answer 9

A:

I'd sure love to see the profiling that shows this is your most significant bottleneck, but in any case if you're going to pull something like this, use a union instead of 50 instructions looping and shifting. Here's a little example program, I'll leave it to you to fit it into your program.

/* union -- demonstrate union for characters */

#include <stdio.h>

union c4_i {
    char c4[5];
    int  i ;
} ;

union c4_i ex;

int main (){
    ex.c4[0] = 'a';
    ex.c4[1] = 'b';
    ex.c4[2] = 'c';
    ex.c4[3] = 'd';
    ex.c4[4] = '\0';
    printf("%s 0x%08x\n", ex.c4, ex.i );
    return 0;
}

Here's example output:

bash $ ./union
abcd 0x64636261
bash $

Charlie Martin 2008-11-30 01:31:04

I did a similar thing by just casting the char* to an int* and then reading from within it. That's where I got into endian problems.

Dustin 2008-11-30 01:36:52

None of these things is going to be immune to problems with endian-ness, etc, plus this is vulnerable to someone mixing case.

Charlie Martin 2008-11-30 20:56:31

Answer 10

+4 A:

char *months = "Jan/Feb/Mar/Apr/May/Jun/Jul/Aug/Sep/Oct/Nov/Dec/";
char *p = strnstr(months, input, 4);
return p ? (p - months) / 4 : -1;

Scott Evernden 2008-11-30 01:54:08

I did that one as well. It's certainly smaller, but wasn't faster as it took considerably longer to parse December dates than January dates. I had one that would adapt the haystack to take advantage of the input being chronological as well. More complicated, but not really faster.

Dustin 2008-11-30 01:57:16

that surprising as the strstr() should compile into nearly a single machine instruction on most PCs

Scott Evernden 2008-11-30 04:17:33

@Dustin, I ran some benchmarks without optimization using your original method and Scott's method on my 5 year old computer. Scott found Jan ten million times in 1.6s and Dec in 8.3s, your program found both Jan and Dec in 2.6s. So while significantly clearer, Scott's method is a little slower...

Robert Gamble 2008-11-30 06:27:29

in some cases. My question is this: what exactly are you doing that only being able to call this function a million times a second causes a bottleneck?

Robert Gamble 2008-11-30 06:29:09

This is a neat solution; the only downside I see is that strnstr() is not a standard C function.

Jonathan Leffler 2008-11-30 21:17:34

@Jonathan's right about strnstr, I tested with strstr instead which I forgot to mention.

Robert Gamble 2008-12-01 02:12:04

Answer 11

+1 A:

Comeau compiler

Comeau C/C++ 4.3.10.1 (Oct  6 2008 11:28:09) for ONLINE_EVALUATION_BETA2
Copyright 1988-2008 Comeau Computing.  All rights reserved.
MODE:strict errors C99 

"ComeauTest.c", line 11: warning: multicharacter character literal (potential
          portability problem)
          case 'Jan/': rv=0; break;
               ^

"ComeauTest.c", line 12: warning: multicharacter character literal (potential
          portability problem)
          case 'Feb/': rv=1; break;
               ^

"ComeauTest.c", line 13: warning: multicharacter character literal (potential
          portability problem)
          case 'Mar/': rv=2; break;
               ^

"ComeauTest.c", line 14: warning: multicharacter character literal (potential
          portability problem)
          case 'Apr/': rv=3; break;
               ^

"ComeauTest.c", line 15: warning: multicharacter character literal (potential
          portability problem)
          case 'May/': rv=4; break;
               ^

"ComeauTest.c", line 16: warning: multicharacter character literal (potential
          portability problem)
          case 'Jun/': rv=5; break;
               ^

"ComeauTest.c", line 17: warning: multicharacter character literal (potential
          portability problem)
          case 'Jul/': rv=6; break;
               ^

"ComeauTest.c", line 18: warning: multicharacter character literal (potential
          portability problem)
          case 'Aug/': rv=7; break;
               ^

"ComeauTest.c", line 19: warning: multicharacter character literal (potential
          portability problem)
          case 'Sep/': rv=8; break;
               ^

"ComeauTest.c", line 20: warning: multicharacter character literal (potential
          portability problem)
          case 'Oct/': rv=9; break;
               ^

"ComeauTest.c", line 21: warning: multicharacter character literal (potential
          portability problem)
          case 'Nov/': rv=10; break;
               ^

"ComeauTest.c", line 22: warning: multicharacter character literal (potential
          portability problem)
          case 'Dec/': rv=11; break;
               ^

"ComeauTest.c", line 1: warning: function "parseMonth" was declared but never
          referenced
  static int parseMonth(const char *input) {
             ^

uvts_cvs 2008-11-30 10:45:57

Thanks for the compiler pointer. I'm not worried about compilers complaining about potential portability problems, just compilers producing code that behaves differently.

Dustin 2008-11-30 20:14:23

Answer 12

A:

As mentioned by others, that code throws a bunch of warnings and is probably not endian-safe.

Was your original date parser hand-written as well? Have you tried strptime(3)?

HUAGHAGUAH 2008-11-30 18:38:45

Answer 13

+2 A:

National Instrument's CVI 8.5 for Windows compiler fails on your original code with multiple warnings:

  Warning: Excess characters in multibyte character literal ignored.

and errors of the form:

  Duplicate case label '77'.

It succeeds on Jonathan's code.

AShelly 2008-12-02 02:37:54

ansaurus

tags:

views:

answers:

Does anyone know of a C compiler that fails to compile this?

related questions