views:

114

answers:

4

Hi,

I'm trying to build my own Hash Table in C from scratch as an exercise and I'm doing one little step at a time. But I'm having a little issue...

I'm declaring the Hash Table structure as pointer so I can initialize it with the size I want and increase it's size whenever the load factor is high.

The problem is that I'm creating a table with only 2 elements (it's just for testing purposes), I'm allocating memory for just those 2 elements but I'm still able to write to memory locations that I shouldn't. And I also can read memory locations that I haven't written to.

Here's my current code:

#include <stdio.h>
#include <stdlib.h>


#define HASHSIZE 2


typedef char *HashKey;
typedef int HashValue;

typedef struct sHashTable {
    HashKey key;
    HashValue value;
} HashEntry;

typedef HashEntry *HashTable;


void hashInsert(HashTable table, HashKey key, HashValue value) {
}

void hashInitialize(HashTable *table, int tabSize) {
    *table = malloc(sizeof(HashEntry) * tabSize);

    if(!*table) {
        perror("malloc");
        exit(1);
    }

    (*table)[0].key = "ABC";
    (*table)[0].value = 45;
    (*table)[1].key = "XYZ";
    (*table)[1].value = 82;
    (*table)[2].key = "JKL";
    (*table)[2].value = 13;
}


int main(void) {
    HashTable t1 = NULL;

    hashInitialize(&t1, HASHSIZE);

    printf("PAIR(%d): %s, %d\n", 0, t1[0].key, t1[0].value);
    printf("PAIR(%d): %s, %d\n", 1, t1[1].key, t1[1].value);
    printf("PAIR(%d): %s, %d\n", 3, t1[2].key, t1[2].value);
    printf("PAIR(%d): %s, %d\n", 3, t1[3].key, t1[3].value);

    return 0;
}

You can easily see that I haven't allocated space for (*table)[2].key = "JKL"; nor (*table)[2].value = 13;. I also shouldn't be able read the memory locations in the last 2 printfs in main().

Can someone please explain this to me and if I can/should do anything about it?

EDIT:
Ok, I've realized a few things about my code above, which is a mess... But I have a class right now and can't update my question. I'll update this when I have the time. Sorry about that.

EDIT 2:
I'm sorry, but I shouldn't have posted this question because I don't want my code like I posted above. I want to do things slightly different which makes this question a bit irrelevant. So, I'm just going to assume this was question that I needed an answer for and accept one of the correct answers below. I'll then post my proper questions...

+7  A: 

Just don't do it, it's undefined behavior.

It might accidentially work because you write/read some memory the program doesn't actually use. Or it can lead to heap corruption because you overwrite metadata used by the heap manager for its purposes. Or you can overwrite some other unrelated variable and then have hard times debugging the program that goes nuts because of that. Or anything else harmful - either obvious or subtle yet severe - can happen.

Just don't do it - only read/write memory you legally allocated.

sharptooth
+1: Welcome to C programming. There's no bounds checking that's part of the language. So you can do all kinds of things that you shouldn't do.
S.Lott
A: 

Usually malloc will allocate more memory than you require to for alignment purpose. Also because the process really have read/write access to the heap memory region. So reading a few bytes outside of the allocated region seldom trigger any errors.

But still you should not do it. Since the memory you're writing to can be regarded as unoccupied or is in fact occupied by others, anything can happen e.g. the 2nd and 3rd key/value pair will become garbage later or an irrelevant vital function will crash due to some invalid data you've stomped onto its malloc-ed memory.

(Also, either use char[≥4] as the type of key or malloc the key, because if the key is unfortunately stored on the stack it will become invalid later.)

KennyTM
"ABC" is string literal and stored in read-only data segment, not on stack. OP is copying pointers, not actual strings, so declaration for `HashEntry` is completely correct.
qrdl
@qrdl: So in your opinion the heap is going to support constant string literal keys only?
KennyTM
@Kenny I didn't get your question, and I didn't mention the heap in my comment. I said that all string literals are located in read-only data segment, not on stack, therefore it is completely safe to assign some global pointer an address of string literal within a function.
qrdl
@qrdl: OK I rephrased that sentence.
KennyTM
Your edit didn't solve the issue. The way OP is doing it now is absolutely fine - he isn't using `strcmp()` to copy the string, he is just assigning pointers. You don't need to allocate memory for the pointer before such assignment, moreover, if you do so, you are leaking memory.
qrdl
@qrdl: `char s[1000];fgets(s,1000,stdin);(*table)[n].key=s;(*table)[n].value=0;return;`
KennyTM
@Kenny What does it suppose to mean? OP is assigning address of **string literal** to global pointer. Consider string literal as global `const char const *`. Your example is completely different - you are assigning the address of automatic variable to global pointer, which is a bad thing, no doubt about it.
qrdl
@qrdl: And that goes back to my 2nd comment. In your opinion the hash table is going to support constant string literal keys only?
KennyTM
@Kenny You are speculating on OP's intentions. His code is correct (except the out-of-bound array access) while he is doing the way he is doing. As to your second comment, did you mean "hash table" rather than "heap"? It would explain a lot. Try to be more careful with terms - they are important. Out.
qrdl
+2  A: 

Generally speaking (different implementation for different platforms) when a malloc or similar heap based allocation call is made, the underlying library translates it into a system call. When the library does that, it generally allocates space in sets of regions - which would be equal or larger than the amount the program requested.

Such an arrangement is done so as to prevent frequent system calls to kernel for allocation, and satisfying program requests for Heap faster (This is certainly not the only reason!! - other reasons may exist as well).

Fall through of such an arrangement leads to the problem that you are observing. Once again, its not always necessary that your program would be able to write to a non-allocated zone without crashing/seg-faulting everytime - that depends on particular binary's memory arrangement. Try writing to even higher array offset - your program would eventually fault.

As for what you should/should-not do - people who have responded above have summarized fairly well. I have no better answer except that such issues should be prevented and that can only be done by being careful while allocating memory.

One way of understanding is through this crude example: When you request 1 byte in userspace, the kernel has to allocate a whole page atleast (which would be 4Kb on some Linux systems, for example - the most granular allocation at kernel level). To improve efficiency by reducing frequent calls, the kernel assigns this whole page to the calling Library - which the library can allocate as when more requests come in. Thus, writing or reading requests to such a region may not necessarily generate a fault. It would just mean garbage.

Shrey
+1  A: 

In C, you can read to any address that is mapped, you can also write to any address that is mapped to a page with read-write areas.

In practice, the OS gives a process memory in chunks (pages) of normally 8K (but this is OS-dependant). The C library then manages these pages and maintains lists of what is free and what is allocated, giving the user addresses of these blocks when asked to with malloc.

So when you get a pointer back from malloc(), you are pointing to an area within an 8k page that is read-writable. This area may contain garbage, or it contain other malloc'd memory, it may contain the memory used for stack variables, or it may even contain the memory used by the C library to manage the lists of free/allocated memory!

So you can imagine that writing to addresses beyond the range you have malloc'ed can really cause problems:

  • Corruption of other malloc'ed data
  • Corruption of stack variables, or the call stack itself, causing crashes when a function return's
  • Corruption of the C-library's malloc/free management memory, causing crashes when malloc() or free() are called

All of which are a real pain to debug, because the crash usually occurs much later than when the corruption occurred.

Only when you read or write from/to the address which does not correspond to a mapped page will you get a crash... eg reading from address 0x0 (NULL)

Malloc, Free and pointers are very fragile in C (and to a slightly lesser degree in C++), and it is very easy to shoot yourself in the foot accidentally

There are many 3rd party tools for memory checking which wrap each memory allocation/free/access with checking code. They do tend to slow your program down, depending on how much checking is applied..

CuriousPanda
As per my knowledge, page size is not fixed to 8K on all platforms.
Adil
true, edited to reflect this
CuriousPanda