views:

236

answers:

7

I won't go into details, but I'm attempting to implement an algorithm similar to the Boyer-Moore-Horspool algorithm, only using hex color values instead of characters (i.e., there is a much greater range).

Following the example on Wikipedia, I originally had this:

size_t jump_table[0xFFFFFF + 1];
memset(jump_table, default_value, sizeof(jump_table);

However, 0xFFFFFF is obviously a huge number and this quickly causes C to seg-fault (but not stack-overflow, disappointingly).

Basically, what I need is an efficient associative array mapping integers to integers. I was considering using a hash table, but having a malloc'd struct for each entry just seems overkill to me (I also do not need hashes generated, as each key is a unique integer and there can be no duplicate entries).

Does anyone have any alternatives to suggest? Am I being overly pragmatic about this?

Update

For those interested, I ended up using a hash table via the uthash library.

+7  A: 

0xffffff is rather too large to put on the stack on most systems, but you absolutely can malloc a buffer of that size (at least on current computers; not so much on a smartphone). Whether or not you should do it for this task is a separate issue.

Edit: Based on the comment, if you expect the common case to have a relatively small number of entries other than the "this color doesn't appear in the input" skip value, you should probably just go ahead and use a hash map (obviously only storing values that actually appear in the input).

(ignore earlier discussion of other data structures, which was based on an incorrect recollection of the algorithm under discussion -- you want to use a hash table)

Stephen Canon
It depends on the input. Depending on the input buffer size, it could be as little as 10 or as much as 1,764,000 (or theoretically 0xFFFFFF = 16,777,216, but that would be a very very rare case).
meeselet
+3  A: 

If the array you were going to make (of size 0xFFFFFF) was going to be sparse you could try making a smaller array to act as a simple hash table, with the size being 0xFFFFFF / N and the hash function being hexValue / N (or hexValue % (0xFFFFFF / N)). You'll have to be creative to handle collisions though.

This is the only way I can foresee getting out of mallocing structs.

Graphics Noob
Sorry if I'm being dense, but what do you mean by N? The number of colors in the input?
meeselet
N would just be a scalar that you choose. It just needs to be big enough that an array of 0xFFFFFF / N integers will fit on the stack. 0xFFFFFF integers equates to 64MB (assuming a 32 bit integer), and gcc has a default stack size of 8MB IIRC.
Graphics Noob
A: 

At possibly the cost of some speed, you could try modifying the algorithm to find only matches that are aligned to some boundary (every three or four symbols), then perform the search at byte level.

Brian
+1  A: 

You can malloc(3) 0xFFFFFF blocks of size_t on the heap (for simplicity), and address them as you do with an array.

As for the stack overflow. Basically the program receives a SIGSEGV, which can be a result of a stack overflow or accessing illegal memory or writing on a read-only segment etc... They are all abstracted under the same error message "Segmentation fault".

But why don't you use a higher level language like python that supports associate arrays?

prime_number
A: 

You could create a sparse array of sorts which has "pages" like this (this example uses 256 "pages", so the upper most byte is the page number):

int *pages[256]; 

/* call this first to make sure all of the pages start out NULL! */
void init_pages(void) {
    for(i = 0; i < 256; ++i) {
        pages[i] = NULL;
    }
}

int get_value(int index) {
    if(pages[index / 0x10000] == NULL) {
        pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
    }
    return pages[index / 0x10000][index % 0x10000];
}

void set_value(int index, int value) {
    if(pages[index / 0x10000] == NULL) {
        pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
    }
    pages[index / 0x10000][index % 0x10000] = value;
}

this will allocate a page the first time it is touched, read or write.

Evan Teran
A: 

To avoid the overhead of malloc you can use a hashtable where the entries in the table are your structs, assuming they are small. In your case a pair of integers should suffice, with a special value to indicate emptyness of the slot in the table.

starblue
+1 Excellent suggestion.
Stephen Canon
A: 

How many values are there in your output space, i.e. how many different values do you map to in the range 0-0xFFFFF?

Using randomized universal hashing you can come up with a collision-free hash function with a table no bigger than 2 times the number of values in your output space (for a static table)

David Claridge