views:

39

answers:

1

I'm writing a huffman compressor and decompressor (in C++) that needs to work on arbitrary binary files. I need a bit of data structure advice. Right now, my compression process is as follows:

  • Read the bytes of the file in binary form to a char* buffer
  • Use an std::map to count the frequencies of each byte pattern in the file. (This is where I think I'm asking for trouble.)
  • Build the binary tree based on the frequency histogram. Each internal node has the sum of the frequencies of its children and each leaf node has a char* to represent the actual byte.

This is where I'm at so far.

My question is what exactly I'm measuring if I just use a map from char* to int. If I'm correct, this isn't actually what I need. What I think I'm really doing is tracking the actual 4-byte pointer values by using char*.

So, what I plan to do is use a map for the histogram and a char for the data stored at leaf nodes. Is my logic sound here? My reasoning tells me yes, but since this is my first time dealing with binary data, I'd like to be careful of pitfalls that will only show up in strange ways.

Thanks.

+3  A: 

You don't need a map; there are only 256 possible values. Just have int freq[256] = {0} and add to it with freq[data[idx]]++ for each byte in the input.

If you REALLY want a map, use map<unsigned char, int>; your suspicion on using map from char* is correct.

Walter Mundt
That makes a lot of sense actually. I'm all for avoiding debugging STL container errors.
RyanG
Beyond that, maps actually have a fair bit of overhead. Every entry is a separate heap alloc, and every lookup involves around lg(N) pointer dereferences. This is well and good if you're storing a lot of items with widely-spaced or complex keys, but it's definitely better to stick with an array when you can get away with it.
Walter Mundt