Byte frequency table for Huffman coding

I'm writing a huffman compressor and decompressor (in C++) that needs to work on arbitrary binary files. I need a bit of data structure advice. Right now, my compression process is as follows:

Read the bytes of the file in binary form to a char* buffer
Use an std::map to count the frequencies of each byte pattern in the file. (This is where I think I'm asking for trouble.)
Build the binary tree based on the frequency histogram. Each internal node has the sum of the frequencies of its children and each leaf node has a char* to represent the actual byte.

This is where I'm at so far.

My question is what exactly I'm measuring if I just use a map from char* to int. If I'm correct, this isn't actually what I need. What I think I'm really doing is tracking the actual 4-byte pointer values by using char*.

So, what I plan to do is use a map for the histogram and a char for the data stored at leaf nodes. Is my logic sound here? My reasoning tells me yes, but since this is my first time dealing with binary data, I'd like to be careful of pitfalls that will only show up in strange ways.

Thanks.

That makes a lot of sense actually. I'm all for avoiding debugging STL container errors.

RyanG 2010-08-18 18:32:23

Beyond that, maps actually have a fair bit of overhead. Every entry is a separate heap alloc, and every lookup involves around lg(N) pointer dereferences. This is well and good if you're storing a lot of items with widely-spaced or complex keys, but it's definitely better to stick with an array when you can get away with it.

Walter Mundt 2010-08-18 19:36:49

ansaurus

tags:

views:

answers:

Byte frequency table for Huffman coding

related questions