views:

102

answers:

3

How does this work?

I know to use it you pass in:

  • start: string (e.g. "Item 1, Item 2, Item 3")
  • delim: delimiter string (e.g. ",")
  • tok: reference to a string which will hold the token
  • nextpos (optional): reference to a the position in the original string where the next token starts
  • sdelim (optional): pointer to a character which will hold the starting delimeter of the token
  • edelim (optional): pointer to a character which will hold the ending delimeter of the token

Code:

#include <stdlib.h>
#include <string.h>

int token(char* start, char* delim, char** tok, char** nextpos, char* sdelim, char* edelim) {
    // Find beginning:
    int len = 0;
    char *scanner;
    int dictionary[8];
    int ptr;

    for(ptr = 0; ptr < 8; ptr++) {
        dictionary[ptr] = 0;
    }

    for(; *delim; delim++) {
        dictionary[*delim / 32] |= 1 << *delim % 32;
    }

    if(sdelim) {
        *sdelim = 0;
    }

    for(; *start; start++) {
        if(!(dictionary[*start / 32] & 1 << *start % 32)) {
            break;
        }
        if(sdelim) {
            *sdelim = *start;
        }
    }

    if(*start == 0) {
        if(nextpos != NULL) {
            *nextpos = start;
        }
        *tok = NULL;
        return 0;
    }

    for(scanner = start; *scanner; scanner++) {
        if(dictionary[*scanner / 32] & 1 << *scanner % 32) {
            break;
        }
        len++;
    }

    if(edelim) {
        *edelim = *scanner;
    }

    if(nextpos != NULL) {
        *nextpos = scanner;
    }

    *tok = (char*)malloc(sizeof(char) * (len + 1));

    if(*tok == NULL) {
        return 0;
    }

    memcpy(*tok, start, len);
    *(*tok + len) = 0;


    return len + 1;
}

I get most of it except for:

dictionary[*delim / 32] |= 1 << *delim % 32;

and

dictionary[*start / 32] & 1 << *start % 32

Is it magic?

+4  A: 

Since each character of the delimiter is 8 bits (sizeof(char) == 1 byte), it is limited to 256 possible values.

The dictionary is broken into 8 pieces (int dictionary[8]), 32 possibilities per piece (sizeof(int) is >= 4 bytes) and 32 * 8 = 256.

This forms a 256 bit matrix of values. It then turns on the flag for each character in the delimiter (dictionary[*delim / 32] |= 1 << *delim % 32;). The index of the array is *delim / 32, or the ASCII value of the character divided by 32. Since the ASCII value ranges from 0 to 255, this divide yields a value of 0 to 7 with a remainder. The remainder is which bit to turn on, decided by the modulus operation.

All this does is flag certain bits of the 256 bit matrix as true, if the corresponding ASCII character exists in the delimiter.

Then determining if a character is in the delimiter is simply a lookup in the 256 bit matrix (dictionary[*start / 32] & 1 << *start % 32)

Brandon Horsley
+1  A: 

They store which characters have occurred by making an 8 x 32 = 256 table of bits stored in dictionary.

dictionary[*delim / 32] |= 1 << *delim % 32;

sets the bit corresponding to *delim

dictionary[*start / 32] & 1 << *start % 32

checks the bit

deinst
A: 

OK, so if we send in the string "," for the delimiter then dictionary[*delim / 32] |= 1 << *delim % 32 will be dictionary[1] = 4096. The expression dictionary[*start / 32] & 1 << *start % 32 simply checks for a matching character.

What puzzles me is why they are not using direct char comparison.

ChaosPandion