I'm writing a huffman compressor and decompressor (in C++) that needs to work on arbitrary binary files. I need a bit of data structure advice. Right now, my compression process is as follows:
- Read the bytes of the file in binary form to a char* buffer
- Use an std::map to count the frequencies of each byte pattern in the file. (This is where I think I'm asking for trouble.)
- Build the binary tree based on the frequency histogram. Each internal node has the sum of the frequencies of its children and each leaf node has a char* to represent the actual byte.
This is where I'm at so far.
My question is what exactly I'm measuring if I just use a map from char* to int. If I'm correct, this isn't actually what I need. What I think I'm really doing is tracking the actual 4-byte pointer values by using char*.
So, what I plan to do is use a map for the histogram and a char for the data stored at leaf nodes. Is my logic sound here? My reasoning tells me yes, but since this is my first time dealing with binary data, I'd like to be careful of pitfalls that will only show up in strange ways.
Thanks.