tags:

views:

32

answers:

2

I'm trying to debug a hashset implementation (for a school assignment). Collisions are being managed via linear probing, and I need to count the number of clusters of a given size for the debug routine. I worked this up:

// really just hoping that a 50+ cluster doesn't occur
int clusters[50];
int count = 0;
for (int i=0; i < hashset->dim; i++) {
   if (hashset->array[i] != NULL) {
      count++;
   } else {
      if (count == 0) continue;
      if (clusters[count] == NULL) clusters[count] = 0;
      clusters[count]++;
      count = 0;
   }
}
for (int i=1; i < 50; i++) {
   if (clusters[i] != NULL && clusters[i] != 0)
      printf("%d clusters of size %d\n", clusters[i], i);
}

Seems to make sense, but when I run it I get...

25143 entries in hashset
50286 dimension of the hash array
4585 clusters of size 1
2134 clusters of size 2
1102 clusters of size 3
696 clusters of size 4
388 clusters of size 5
264 clusters of size 6
173 clusters of size 7
104 clusters of size 8
89 clusters of size 9
51 clusters of size 10
46 clusters of size 11
35 clusters of size 12
26 clusters of size 13
22 clusters of size 14
17 clusters of size 15
134553327 clusters of size 16
134634407 clusters of size 17
112 clusters of size 18
6 clusters of size 19
134553324 clusters of size 20
134634399 clusters of size 21
107 clusters of size 22
3 clusters of size 23
2 clusters of size 24
134634401 clusters of size 25
107 clusters of size 26
134107784 clusters of size 27
134556210 clusters of size 28
[... more nonsense]

So initially it seems to give sane output.. lots of clusters, but whatever. My thinking is that the way-too-huge numbers should actually be 0 - they're clusters which don't actually exist but for some reason are still getting printed. I just have no idea why...

A: 

Here is a short C program:

#include <stdio.h>
int main() {
  char buf[20];
  for (int i = 0; i < 20; i++) {
    printf("%d ", buf[i]);
  }
  printf("\n"); return 0;
}

Think carefully about what it ought to print when you run it.

  • Will it print the same thing every time you run it?
  • On every machine you run it on?

Spoiler below:

(spoiler padding)













.

Here's what I saw it print for one run: 0 0 0 0 0 0 0 0 -128 5 64 0 0 0 0 0 -16 -60 -55 31

The same bug afflicts both of our programs, but it might be easier to see in mine.

Ben Karel
A: 

You write

they're clusters which don't actually exist but for some reason are still getting printed

You're implicitly assuming that the element clusters[i] doesn't exist if you didn't assign to it. But this is not true. Every value in the the clusters array does exist all the time, whether or not you assign a value to it. If you don't assign a known value, the value is unpredictable, perhaps 134634399. So, if you want all the elements in clusters to be predictable, what do you have to do?

This misunderstanding of C's memory model leads to the following code (excerpted from your question):

int clusters[50];
/* ... */
for (int i=1; i < 50; i++) {
   if (clusters[i] != NULL && clusters[i] != 0)
      printf("%d clusters of size %d\n", clusters[i], i);
}

What is the purpose of the test clusters[i] != NULL? You're trying to decide if clusters[i] has ever been set by the loop I've omitted, but testing against NULL can't do that. The elements of clusters are not some sort of pointer, they are raw 4-byte integer values. They always have some value, but unless you set them to something, that value is unpredictable.

Dale Hagglund