tags:

views:

83

answers:

2

For various reasons I won't discuss here, I am writing a simple tokenizer in C. Below is an example I hacked out which resizes the token buffer in predetermined increments as necessary when reading characters from the input stream. It will ultimately reach the size of the largest token which can obviously accommodate smaller tokens. Is this an acceptable approach? If not, what is the best way to determine the correct amount of memory to allocate for each token?

#include <stdio.h>
#include <string.h>
#include <ctype.h>

#define BUFF_CHUNK_SIZE 4

int main() {
    int c, i, n;
    char *buffer;

    i = 0;
    n = BUFF_CHUNK_SIZE;
    buffer = (char *)calloc(n, sizeof(char));

    while ((c = getchar()) != EOF) {
        if (isalnum(c)) {
            buffer[i] = (char)c;
            i++;
            if (i == n) {
                n += BUFF_CHUNK_SIZE;
                buffer = (char *)realloc(buffer, n * sizeof(char));
            }
        }
        else {
            if (i == 0) {
                continue;
            }
            i = 0;
            printf("%s\n", buffer);
            memset(buffer, 0, sizeof(char) * n);
        }
    }
    return 0;
}
+5  A: 

This is pretty much the right approach - with two tweaks.

Firstly, instead of adding a constant BUFF_CHUNK_SIZE, it's usually better to multiply it by a fixed amount. This means that your number of reallocs on a long string of length N becomes proportional to log N rather than N - meaning that the time spent in realloc() is proportional to N log N rather than N2. It doesn't really matter what the constant is - 1.5 might be a good choice (n += n / 2;).

Secondly, in a longer program you should really check for realloc() failing.

caf
Yeap, otherwise that code just wastes all the time reallocating and copying instead of doing real work. Also maybe the initial block size should be bigger - depending on how frequent are tokens of different lengths.
sharptooth
I know I shold be checking return of failed *allocs, but this was a quick and dirty mock up. But taking the advice from @caf and previously @RBerteig I should resize like this, correct?n *= 1.5;buffer = (char *)realloc(buffer, n);memset(buffer + i, 0, n - i);
Timothy
A: 

realloc was right, but you should use a char as token-separator?!

#define BUFF_CHUNK_SIZE 4
#define TOKSEP ";"

char *getOneToken(char *s,size_t n)
{
  int c;
  char *p=s;
  while( p-s < n-1 && !feof(stdin) && ((c=getchar())=='\n'?c=getchar():1) )
    if( isalnum(c) )
      *p++=c;
  *p=0;
  return s;
}

main() 
{
  char *buffer=calloc(1,1),
        tok[BUFF_CHUNK_SIZE+1];

  while( *getOneToken(tok,sizeof tok) )
  {
    buffer=realloc(buffer,strlen(buffer)+strlen(tok)+2);
    if( *buffer ) strcat(buffer,TOKSEP);
    strcat(buffer,tok);
  }

  puts(buffer);
  free(buffer);
  return 0;
}