views:

321

answers:

6

For brushing up my C, I'm writing some useful library code. When it came to reading text files, it's always useful to have a convenient tokenization function that does most of the heavy lifting (looping on strtok is inconvenient and dangerous).

When I wrote this function, I'm amazed at its intricacy. To tell the truth, I'm almost convinced that it contains bugs (especially with memory leaks in case of an allocation error). Here's the code:

/* Given an input string and separators, returns an array of 
** tokens. Each token is a dynamically allocated, NUL-terminated
** string. The last element of the array is a sentinel NULL 
** pointer. The returned array (and all the strings in it) must
** be deallocated by the caller.
**
** In case of errors, NULL is returned.
** 
** This function is much slower than a naive in-line tokenization,
** since it copies the input string and does many allocations.
** However, it's much more convenient to use.
*/ 
char** tokenize(const char* input, const char* sep)
{
    /* strtok ruins its input string, so we'll work on a copy 
    */
    char* dup;

    /* This is the array filled with tokens and returned
    */
    char** toks = 0;

    /* Current token
    */
    char* cur_tok;

    /* Size of the 'toks' array. Starts low and is doubled when
    ** exhausted.
    */
    size_t size = 2;

    /* 'ntok' points to the next free element of the 'toks' array
    */
    size_t ntok = 0;
    size_t i;

    if (!(dup = strdup(input)))
        return NULL;

    if (!(toks = malloc(size * sizeof(*toks))))
        goto cleanup_exit;

    cur_tok = strtok(dup, sep);

    /* While we have more tokens to process...
    */
    while (cur_tok)
    {
        /* We should still have 2 empty elements in the array, 
        ** one for this token and one for the sentinel.
        */
        if (ntok > size - 2)
        {
            char** newtoks;
            size *= 2;

            newtoks = realloc(toks, size * sizeof(*toks));

            if (!newtoks)
                goto cleanup_exit;

            toks = newtoks;
        }

        /* Now the array is definitely large enough, so we just
        ** copy the new token into it.
        */
        toks[ntok] = strdup(cur_tok);

        if (!toks[ntok])
            goto cleanup_exit;

        ntok++;
        cur_tok = strtok(0, sep);
    }    

    free(dup);
    toks[ntok] = 0;
    return toks;

cleanup_exit:
    free(dup);
    for (i = 0; i < ntok; ++i)
        free(toks[i]);
    free(toks);
    return NULL;
}

And here's simple usage:

int main()
{
    char line[] = "The quick brown fox jumps over the lazy dog";
    char** toks = tokenize(line, " \t");
    int i;

    for (i = 0; toks[i]; ++i)
        printf("%s\n", toks[i]);

    /* Deallocate
    */
    for (i = 0; toks[i]; ++i)
        free(toks[i]);
    free(toks);

    return 0;
}

Oh, and strdup:

/* strdup isn't ANSI C, so here's one...
*/
char* strdup(const char* str)
{
    size_t len = strlen(str) + 1;
    char* dup = malloc(len);

    if (dup)
        memcpy(dup, str, len);

    return dup;
}

A few things to note about the code of the tokenize function:

  1. strtok has the impolite habit of writing over its input string. To save the user's data, I only call it on a duplicate of the input. The duplicate is obtained using strdup.
  2. strdup isn't ANSI-C, however, so I had to write one
  3. The toks array is grown dynamically with realloc, since we have no idea in advance how many tokens there will be. The initial size is 2 just for testing, in real-life code I would probably set it to a much higher value. It's also returned to the user, and the user has to deallocate it after use.

  4. In all cases, extreme care is taken not to leak resources. For example, if realloc returns NULL, it won't run over the old pointer. The old pointer will be released and the function returns. No resources leak when tokenize returns (except in the nominal case where the array returned to the user must be deallocated after use).

  5. A goto is used for more convenient cleanup code, according to the philosophy that goto can be good in some cases (this is a good example, IMHO).

The following function can help with simple deallocation in a single call:

/* Given a pointer to the tokens array returned by 'tokenize',
** frees the array and sets it to point to NULL.
*/
void tokenize_free(char*** toks)
{
    if (toks && *toks)
    {
        int i;

        for (i = 0; (*toks)[i]; ++i)
            free((*toks)[i]);
        free(*toks);
        *toks = 0;
    }
}

I'd really like to discuss this code with other users of SO. What could've been done better? Would you recommend a difference interface to such a tokenizer? How is the burden of deallocation taken from the user? Are there memory leaks in the code anyway?

Thanks in advance

A: 

If you want to find memory leaks, one possibility is to run it with valgrind.

Kinopiko
+2  A: 
R Samuel Klatchko
Michael Burr
@Michael Burr - frack, I forgot that strtok() doesn't return pointers to 0 length tokens.
R Samuel Klatchko
A: 

Hi, there is a great tools to detect Memory leak which is called Valgrind.

http://valgrind.org/

Phong
+1  A: 

I don't see anything wrong with the strtok approach to modifying a string in-line - it's the callers choice if they want to operate on a duplicated string or not as the semantics are well understood. Below is the same method slightly simplified to use strtok as intended, yet still return a handy array of char * pointers (which now simply point to the tokenized segments of the original string). It gives the same output for your original main() call.

The main advantage of this approach is that you only have to free the returned character array, instead of looping through to clear all of the elements - an aspect which I thought took away a lot of the simplicity factor and something a caller would be very unlikely to expect to do by any normal C convention.

I also took out the goto statements, because with the code refactored they just didn't make much sense to me. I think the danger of having a single cleanup point is that it can start to grow too unwieldy and do extra steps that are not needed to clean up issues at specific locations.

Personally I think the main philosophical point I would make is that you should respect what other people using the language are going to expect, especially when creating library kinds of calls. Even if the strtok replacement behavior seems odd to you, the vast majority of C programmers are used to placing \0 in the middle of C strings to split them up or create shorter strings and so this will seem quite natural. Also as noted no-one is going to expect to do anything beyond a single free() with the return value from a function. You need to write your code in whatever way needed to make sure then that the code works that way, as people will simply not read any documentation you might offer and will instead act according to the memory convention of your return value (which is char ** so a caller would expect to have to free that).

char** tokenize(char* input, const char* sep)
{
  /* Size of the 'toks' array. Starts low and is doubled when
  ** exhausted.
  */
  size_t size = 4;

  /* 'ntok' points to the next free element of the 'toks' array
   */
  size_t ntok = 0;

  /* This is the array filled with tokens and returned
   */
  char** toks = malloc(size * sizeof(*toks));

  if ( toks == NULL )
    return;

  toks[ntok] = strtok( input, sep );


  /* While we have more tokens to process...
   */

  do
    {
      /* We should still have 2 empty elements in the array, 
      ** one for this token and one for the sentinel.
      */
      if (ntok > size - 2)
        {
      char** newtoks;
      size *= 2;

      newtoks = realloc(toks, size * sizeof(*toks));

      if (newtoks == NULL)
        {
          free(toks);
          return NULL;
        }

      toks = newtoks;
        }
      ntok++;
      toks[ntok] = strtok(0, sep);
    }  while (toks[ntok]);

  return toks;
}
Kendall Helmstetter Gelner
The 'free' problem is solved by providing a function to the user that frees the tokens array. After all it's a common practice when writing ADTs in C to provide a cleanup function that's not a simple 'free'. But you're making very good points with regards to convention. Indeed this makes the implementation much simpler. However, there are tradeoffs. The user could be interested in keeping his string as well. If he isn't, your solution definitely wins in the tidyness department.
Eli Bendersky
As I said no user of your call is *EVER* going to read your documentation, so they will not know to call that - it's simply not a normal C convention to have calls like that. Yes it works technically but in practice it's the choice that launches a million memory leaks. Again if the user wishes to keep the string intact they can simply choose to call strdup() themselves before the call, C is more about freedom of options than convenience so it's dangerous for coders to assume too much what the caller wants. If we were talking about a higher level language my advice would differ drastically.
Kendall Helmstetter Gelner
You're making a very valid point. However, I'm concerned about your first statement. C libraries all around have semi-complex ADTs that have custom cleanup functions. Don't you think C programmers are used to release ADTs using such functions? Or do you mean that once a C programmer sees char** he thinkg he needs no custom cleanup. To make it into a real ADT, I guess one would have to wrap it into a struct and provide access functions. THEN, the user would automatically know a custom cleanup call is required.
Eli Bendersky
BTW, I've just looked at the code of glib, and its g_strsplit function takes the approach I've taken, allocating a 2D array of strings and returning it for the user to work with and free
Eli Bendersky
+1  A: 

Just a few things:

  1. Using gotos is not intrinsically evil or bad, much like the preprocessor, they are often abused. In cases like yours where you have to exit a function differently depending on how things went, they are appropriate.
  2. Provide a functional means of freeing the returned array. I.e. tok_free(pointer).
  3. Use the re-entrant version of strtok() initially, i.e. strtok_r(). It would not be cumbersome for someone to pass an additional argument (even NULL if not needed) for that.
Tim Post
strtok_r isn't ANSI C, right? or is it part of C99?
Eli Bendersky
strtok_r() is required by POSIX but not ISO C99 (or C89).
Jonathan Leffler
A public domain `strtok_r()` for anyone who wants but doesn't have one: http://snipplr.com/view/16918/strtokr/
Michael Burr
+1  A: 

You don't need to strdup() each token; you duplicate the input string, and could let strtok() chop that up. It simplifies releasing the resources afterwards, too - you only have to release the array of pointers and the single string.

I agree with those who say that you need a function to release the data - unless you change the interface radically and have the user provide the array of pointers as an input parameter, and then you would probably also decide that the user is responsible for duplicating the string if it must be preserved. That leads to an interface:

int tokenize(char *source, const char *sep, char **tokens, size_t max_tokens);

The return value would be the number of tokens found.

You have to decide what to do when there are more tokens than slots in the array. Options include:

  • returning an error indication (negative number, likely -1), or
  • the full number of tokens found but the pointers that can't be assigned aren't, or
  • just the number of tokens that fitted, or
  • one more than the number of tokens, indicating that there were more, but no information on exactly how many more.

I chose to return '-1', and it lead to this code:

/*
@(#)File:           $RCSfile: tokenise.c,v $
@(#)Version:        $Revision: 1.9 $
@(#)Last changed:   $Date: 2008/02/11 08:44:50 $
@(#)Purpose:        Tokenise a string
@(#)Author:         J Leffler
@(#)Copyright:      (C) JLSS 1987,1989,1991,1997-98,2005,2008
@(#)Product:        :PRODUCT:
*/

/*TABSTOP=4*/

/*
**  1.  A token is 0 or more characters followed by a terminator or separator.
**      The terminator is ASCII NUL '\0'.  The separators are user-defined.
**  2.  A leading separator is preceded by a zero-length token.
**      A trailing separator is followed by a zero-length token.
**  3.  The number of tokens found is returned.
**      The list of token pointers is terminated by a NULL pointer.
**  4.  The routine returns 0 if the arguments are invalid.
**      It returns -1 if too many tokens were found.
*/

#include "jlss.h"
#include <string.h>

#define NO  0
#define YES 1

#define IS_SEPARATOR(c,s,n) (((c) == *(s)) || ((n) > 1 && strchr((s),(c))))
#define DIM(x)              (sizeof(x)/sizeof(*(x)))

#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_tokenise_c[] = "@(#)$Id: tokenise.c,v 1.9 2008/02/11 08:44:50 jleffler Exp $";
#endif /* lint */

int             tokenise(
    char   *str,            /* InOut: String to be tokenised */
    char   *sep,            /* In:    Token separators */
    char  **token,          /* Out:   Pointers to tokens */
    int     maxtok,         /* In:    Maximum number of tokens */
    int     nulls)          /* In:    Are multiple separators OK? */
{
    int             c;
    int             n_tokens;
    int             tokenfound;
    int             n_sep = strlen(sep);

    if (n_sep <= 0 || maxtok <= 2)
        return(0);

    n_tokens = 1;
    *token++ = str;
    while ((c = *str++) != '\0')
    {
        tokenfound = NO;
        while (c != '\0' && IS_SEPARATOR(c, sep, n_sep))
        {
            tokenfound = YES;
            *(str - 1) = '\0';
            if (nulls)
                break;
            c = *str++;
        }
        if (tokenfound)
        {
            if (++n_tokens >= maxtok - 1)
                return(-1);
            if (nulls)
                *token++ = str;
            else
                *token++ = str - 1;
        }
        if (c == '\0')
            break;
    }
    *token++ = 0;
    return(n_tokens);
}

#ifdef TEST

struct
{
    char           *sep;
    int             nulls;
}               data[] =
{
    {   "/.",   0   },
    {   "/.",   1   },
    {   "/",    0   },
    {   "/",    1   },
    {   ".",    0   },
    {   ".",    1   },
    {   "",     0   }
};

static char string[] = "/fred//bill.c/joe.b/";

int main(void)
{
    int             i;
    int             j;
    int             n;
    char            input[100];
    char           *token[20];

    for (i = 0; i < DIM(data); i++)
    {
        strcpy(input, string);
        printf("\n\nTokenising <<%s>> using <<%s>>, null %d\n",
               input, data[i].sep, data[i].nulls);
        n = tokenise(input, data[i].sep, token, DIM(token),
                     data[i].nulls);
        printf("Return value = %d\n", n);
        for (j = 0; j < n; j++)
            printf("Token %d: <<%s>>\n", j, token[j]);
        if (n > 0)
            printf("Token %d: 0x%08lX\n", n, (unsigned long)token[n]);
    }
    return(0);
}

#endif /* TEST */
Jonathan Leffler