tags:

views:

380

answers:

5

What would be an efficient way of converting a delimited string into an array of strings in C (not C++)? For example, I might have:

char *input = "valgrind --leak-check=yes --track-origins=yes ./a.out"

The source string will always have only a single space as the delimiter. And I would like a malloc'ed array of malloc'ed strings char *myarray[] such that:

myarray[0]=="valgrind"
myarray[1]=="--leak-check=yes"
...

Edit I have to assume that there are an arbitrary number of tokens in the inputString so I can't just limit it to 10 or something.

I've attempted a messy solution with strtok and a linked list I've implemented, but valgrind complained so much that I gave up.

(If you're wondering, this is for a basic Unix shell I'm trying to write.)

+1  A: 

Were you remembering to malloc an extra byte for the terminating null that marks the end of string?

Arthur Kalliokoski
Yes: `char *singleToken = (char *)malloc(strlen(tokPtr)*sizeof(char)+1);` where `tokPtr` was the return value of `strtok`.
yankle
+1  A: 

From the strsep(3) manpage on OSX:

   char **ap, *argv[10], *inputstring;

   for (ap = argv; (*ap = strsep(&inputstring, " \t")) != NULL;)
           if (**ap != '\0')
                   if (++ap >= &argv[10])
                           break;

Edited for arbitrary # of tokens:

char **ap, **argv, *inputstring;

int arglen = 10;
argv = calloc(arglen, sizeof(char*));
for (ap = argv; (*ap = strsep(&inputstring, " \t")) != NULL;)
    if (**ap != '\0')
        if (++ap >= &argv[arglen])
        {
            arglen += 10;
            argv = realloc(argv, arglen);
            ap = &argv[arglen-10];
        }

Or something close to that. The above may not work, but if not it's not far off. Building a linked list would be more efficient than continually calling realloc, but that's really besides the point - the point is how best to make use of strsep.

Ben Collins
Thanks. I forgot to mention that I have to assume that there's an arbitrary number of tokens in the `inputString`- I can't assume 10, for instance.
yankle
+2  A: 

What's about something like:

char* string = "valgrind --leak-check=yes --track-origins=yes ./a.out";
char** args = (char**)malloc(MAX_ARGS*sizeof(char*));
memset(args, 0, sizeof(char*)*MAX_ARGS);

char* curToken = strtok(string, " \t");

for (int i = 0; curToken != NULL; ++i)
{
  args[i] = strdup(curToken);
  curToken = strtok(NULL, " \t");
}
Jack
Actually I think that using a 256 buffer of pointers to strings wouldn't be such a waste, unless you really need to preserve memory..
Jack
strtok() modifies the input string, so using it on a string literal will crash on some platforms.
bk1e
I could assume that `MAX_ARGS` is something safe like 10,000, but the code still ought to work for 10,001 args...
yankle
that's true, actually strtok() usually replaces the first delimiter at the end of a token with \0 to easily return the token. It was just to explain the snippet :)
Jack
The application of this is eventually going to be using the array as a parameter to `execv`, so it's the array of arguments to whatever command I'm calling.
yankle
Yes, but using a list will waste double the space.. so do you really need to handle 10000+ params? Is it a constraint of the project or what?
Jack
Ok, so the only difference is using a linked list.. what's your problem with it? you just strdup() in a list element
Jack
Yes, it's a constraint of the project: "Do not assume that there is a limit on the number of args to a given command."The input in general is limited to 4096 bytes, so I suppose 4096 would work for `MAX_ARGS` now that I think of it, no?
yankle
ofc, unless you can use half-bytes :D btw if you need help with the list I can edit the snippet..
Jack
Beautiful! It works! Thanks, Jack.
yankle
It'd be easy to make MAX_ARGS not be a constant and determine it at runtime. Either just iterate over the input and count spaces, or call `strlen()` and assume a worst case scenario where every character is a space.
jamesdlin
+2  A: 

if you have all of the input in input to begin with then you can never have more tokens than strlen(input). If you don't allow "" as a token, then you can never have more than strlen(input)/2 tokens. So unless input is huge you can safely write.

char ** myarray = malloc( (strlen(input)/2) * sizeof(char*) );

int NumActualTokens = 0;
while (char * pToken = get_token_copy(input))
{ 
   myarray[++NumActualTokens] = pToken;
   input = skip_token(input);
}

char ** myarray = (char**) realloc(myarray, NumActualTokens * sizeof(char*));

As a further optimization, you can keep input around and just replace spaces with \0 and put pointers into the input buffer into myarray[]. No need for a separate malloc for each token unless for some reason you need to free them individually.

John Knoeller
Using your `strlen(input)/2` idea- Thanks!
yankle
A: 
tommieb75