views:

162

answers:

7

I need to parse strings of four hex characters to an integer. The characters appear inside a longer string, and there are no separators - I just know the offset they can be found in. The hex characters are case insensitive. Example with offset 3:

"foo10a4bar" -> 4260

I'm looking for a snippet that is

  • Short (too much code always creates complexity)
  • Simple (simple to understand and verify that it is correct)
  • Safe (invalid input is detected and signalled, no potential memory problems)

I'm a bit leery of using the 'sscanf' family of functions for this, but if there's a safe ANSI C solution using them, they can be used.

+3  A: 

strtol is simple with good error handling:

const int OFFSET = 3, LEN = 4;
char hex[LEN + 1];
int i;
for(i = 0; i < LEN && str[OFFSET + i]; i++)
{
  hex[i] = str[OFFSET + i];
  if(!isxdigit((unsigned char) hex[i]))
  {
    // signal error, return
  }
}
if(i != LEN)
{
  // signal error, return
}
hex[LEN] = '\0';
char *end;
int result = (int) strtol(hex, &end, 16);
if(end != hex + LEN)
{
  // signal error, return
}
Matthew Flaschen
Except that you cannot specify the maximum amount of characters to read I believe. So it will read past the buffer if the digits just continue. Also, it allows for preceding spaces, sign prefixes, 0x, etc. so atleast it requires pre-validation to accept *only* the 4 hex characters.
Nakedible
If it can be made short and simple enough, a function that copies four characters to a buffer, adds a terminating '\0', verifies that each character is [0-9a-fA-F] and then calls strtol could be good enough...
Nakedible
@Naked, maybe you should wait for the complete answer before down-voting (if that was you)? However, you do raise a possible point about the 0x, sign, and whitespace.
Matthew Flaschen
Downvotes luckily can be changed (it was me), I was voting based on the answer that was there - which used without caution, would've caused a buffer overflow and other things.
Nakedible
'ishexdigit' doesn't seem to be a standard C function, so that would need to be implemented as well, and would count towards the number of lines...
Nakedible
@Naked, I fail to see how "strtol is simple with good error handling" would cause a buffer overflow. Any function can be misused. And `isxdigit` (sorry for the typo) is standard C.
Matthew Flaschen
@Matthew Flaschen: I don't think that you need the `isxdigit` test in the loop at all. You'd have to initialize `hex` with `{0}` and then your final test will detect this as well.
Jens Gustedt
The argument to `isxdigit()` should be cast to `unsigned char` (it is defined only to work on values in the range of `unsigned char`, or `EOF`).
caf
@Jens, the purpose is to exclude whitespace, signs, 0x, etc, which `strtol` will tolerate. @EOF, thanks, added.
Matthew Flaschen
@caf: no, casting to `unsigned char` is not necessary since `char` values will in any case always be representable in `unsigned char`, so they fit the requirement of `isxdigit`.
Jens Gustedt
@Matthew Flaschen: I see. Still then I would initialize `hex`, do the assignment to `hex[i]` conditionally and capture the error just at the end. But I admit, that this is just nit picking...
Jens Gustedt
@Jens Gustedt: No, `-1` is a simple example of a possible `char` value (on systems where `char` is signed) that is not in the range of `unsigned char` (and in fact, it is quite common for the `is*()` functions to crash on negative values other than EOF). No character in the basic execution set can have a negative value, but it's not clear that input to this function is limited to such characters.
caf
+2  A: 

Here's my attempt

#include <assert.h>

static int h2d(char c) {
    int x;
    switch (c) {
        default: x = -1; break; /* invalid hex digit */
        case '0': x = 0; break;
        case '1': x = 1; break;
        case '2': x = 2; break;
        /* ... */
        case 'E': case 'e': x = 14; break;
        case 'F': case 'f': x = 15; break;
    }
    return x;
}

int hex4(const char *src, int offset) {
    int tmp, val = 0;
    tmp = h2d(*(src+offset+0)); assert(tmp >= 0); val += tmp << 12;
    tmp = h2d(*(src+offset+1)); assert(tmp >= 0); val += tmp << 8;
    tmp = h2d(*(src+offset+2)); assert(tmp >= 0); val += tmp << 4;
    tmp = h2d(*(src+offset+3)); assert(tmp >= 0); val += tmp;
    return val;
}

Of course, instead of assert use your preferred method of validation!

And you can use it like this

int val = hex4("foo10a4bar", 3);
pmg
Looks like it should work nicely - and it is definitely simple to understand. Unfortunately it's also a bit long.
Nakedible
Again, this doesn't check whether the original string is invalid (too short).
Matthew Flaschen
I think checking shouldn't even be part of the function (if you need to check, write another function). But if the string is too short, the h2d() function will catch a '\0'
pmg
Nobody said the string would be '\0' terminated. But, as for my use, it has already been verified that the string has atleast 4 characters after the offset.
Nakedible
If it isn't '\0' terminated it isn't a string.
pmg
It isn't a C-string, but it sure is a string. There are many different kinds of strings: Pascal strings, ropes, etc. And just because a null terminated string is called a C-string and string literals are C-strings doesn't mean that other types of strings wouldn't be used in C-code.
Nakedible
I agree there are all kinds of strings ... but mixing `C` and `string` in the same sentence without clarifications makes the string a nul-terminated one :)
pmg
A: 

Here's my own try at it now that I thought about it for a moment - I'm not at all sure this is the best, so I will wait a while and then accept the answer that seems best to me.

    val = 0;
    for (i = 0; i < 4; i++) {
        val <<= 4;
        if (ptr[offset+i] >= '0' && ptr[offset+i] <= '9')
            val += ptr[offset+i] - '0';
        else if (ptr[offset+i] >= 'a' && ptr[offset+i] <= 'f')
            val += (ptr[offset+i] - 'a') + 10;
        else if (ptr[offset+i] >= 'A' && ptr[offset+i] <= 'F')
            val += (ptr[offset+i] - 'A') + 10;
        else {
            /* signal error */
        }
    }
Nakedible
You're not checking whether the input is too short.
Matthew Flaschen
True, that wasn't included there as it was already done in an earlier step in my code. However, none of the other code snippets check it either, so it should be comparable.
Nakedible
Actually, Frank's (deleted), caf's, and mine all do.
Matthew Flaschen
A: 

Code

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    int offset = atoi(argv[2]);
    argv[1][offset + 4] = '\0';
    printf("%lu\n", strtol(argv[1] + offset, NULL, 0x10));
}

Usage

matt@stanley:$ make small_hex_converter
cc     small_hex_converter.c   -o small_hex_converter
matt@stanley:$ ./small_hex_converter f0010a4bar 3
4260
Matt Joiner
This doesn't check for too short input and it permanently modifies the original string, which the question doesn't allow.
Matthew Flaschen
Yup, needs modifications for those. And does not check for the presence of '-' or '0x' or anything like that.
Nakedible
Why would you want any of that... It's a minimal example, as minimal as it gets. If you don't want to modify the original, use `strndup`. If you want to handle `0x` and `-`, use `scanf`.
Matt Joiner
strndup calls malloc, I wouldn't want to use that. And the problem is that strtol accepts those '-' and other options that should not be accepted! I'm looking for a minimal answer, but not a minimal answer that doesn't satisfy all of the original criteria.
Nakedible
+1  A: 

Here's an alternative based on character arithmetic:

int hexdigits(char *str, int ndigits)
{
    int i;
    int n = 0;
    for (i=0; i<ndigits; ++i) {
        int d = *str++ - '0';
        if (d > 9 || d < 0)
            d += '0' - 'A' + 10;
        if (d > 15 || d < 0)
            d += 'A' - 'a';
        if (d > 15 || d < 0)
            return -1;
        n <<= 4;
        n |= d;
    }
    return n;
}

It should handle digits in both cases, and work for both ASCII and EBCDIC. Using it for more than 7 digits invites integer overflow, and may make the use of -1 as an error value indistinguishable from a valid conversion.

Just call it with the offset added to the base string: e.g. w = hexdigits(buf+3, 4); for the suggested offset of 3 chars into a string stored in buf.

Edit: Here's a version with fewer conditions that is guaranteed to work for ASCII. I'm reasonably certain it will work for EBCDIC as well, but don't have any text of that flavor laying around to prove it.

Also, I fixed a stupid oversight and made the accumulator an int instead of unsigned short. It wouldn't affect the 4-digit case, but it made it overflow at only 16-bit numbers instead of the full capacity of an int.

int hexdigits2(char *str, int ndigits)
{
    int i;
    int n = 0;
    for (i=0; i<ndigits; ++i) {
        unsigned char d = *str++ - '0';
        if (d > 9)
            d += '0' - 'A' + 10;
        if (d > 15)
            d += 'A' - 'a';
        if (d > 15)
            return -1;
        n <<= 4;
        n |= d;
    }
    return n;
}

Usage is the same as the earlier version, but the generated code could be a bit smaller.

RBerteig
Looks good, but it's a bit hard to follow in my opinion. I'm not sure the consecutive if sequencing is any better than just an ifelse sequence.
Nakedible
@Nakedible, "better" depends strongly on the optimizer and target platform. It certainly isn't the only way to skin this cat. I thought it worth showing for the alternative point of view as much as anything else. Also, it probably reads more clearly to someone with too many years experience writing assembly code for 8-bit micros with less than 16KB of RAM plus ROM total, where the extra support for lower case would have been left out of course.
RBerteig
Quite true. Performance or compiled code size are not on my personal priority list for this, so "better" was referring to clarity - but you are right, an assembly coder probably considers this as the simplest.
Nakedible
The second version I added is slightly trickier. It depends on observing that any error in the character range will produce a digit result in an unsigned char that is too big. As a consequence, you only need half the conditions.
RBerteig
I'm not entirely sure that it's necessary to work for EBCDIC, that's been dead for like, over 9000 years.
DeadMG
@DeadMG, EBCDIC still isn't quite as dead as we'd all like it to be. But if you don't need to support it, just don't test against it and don't worry about it because for 99.44% of us it will never be an actual issue.
RBerteig
+2  A: 

It's usually best to use standard functions where you can, to get concise and simple code:

#define HEXLEN 4

long extract_hex(const char *src, size_t offset)
{
    char hex[HEXLEN + 1] = { 0 };
    long val;

    if (strlen(src) < offset + HEXLEN)
        return -1;

    memcpy(hex, src + offset, HEXLEN);

    if (strspn(hex, "0123456789AaBbCcDdEeFf") < HEXLEN)
        return -1;

    errno = 0;
    val = strtol(hex, NULL, 16);

    /* Out of range - can't occur unless HEXLEN > 7 */
    if (errno)
        return -1;

    return val;
}
caf
Has the same problems as earlier solutions - strtol accepts negative numbers, '0x' prefix, spaces, and other things. This solution accepts invalid input.
Nakedible
@Nakedible: I wasn't clear on what the exact validity criteria were, but see the updated version.
caf
Looking pretty good! Although running the strspn *before* memcpy seems a bit weird as the strspn will scan through the end of the string, if there's nothing but hex there - and requires the string to be null terminated. However, if it is moved to be after memcpy, the string is nicely null terminated and also strspn does not ever need to scan more than HEXLEN characters...
Nakedible
@Nakedible: Good points, updated. Though the use of `strlen()` to validate the input is long enough requires the input to be nul-terminated. The alternative is to pass in the length of the string.
caf
I finally picked your answer to be the accepted one - memcpy+strspn+strtol with a couple checks is very compact and uses the standard library very well. It is obvious what is being done. So, short, simple and safe. For the actual code, I used my own solution - as the code may possibly have to run on a non-ANSI C compiler, and expecting strtol and friends to have been implemented properly is taking a chance.
Nakedible
A: 
/* evaluates the first containing hexval in s */
int evalonehexFromStr( const char *s, unsigned long *val )
{
  while( *s )
   if( 1==sscanf(s++, "%04lx", val ) )
    return 1;
  return 0;
}

It works for exactly 4 hex-digits, eg:

unsigned long result;
if( evalonehexFromStr("foo10a4bar", &result) )
  printf("\nOK - %lu", result);

If you need other hex-digit sizes, replace "4" to your size or take "%lx" for any hexval for values up to MAX_ULONG.

This has the same problems as discussed in a previously deleted answer. The `scanf` x specifier is equivalent to `strtoul`, which allows certain unwanted characters including whitespace, +/0, and 0x.
Matthew Flaschen