tags:

views:

484

answers:

5

Can anyone tell me the best way to decode binary data with variable length bit strings in java?

For example:

The binary data is 10101000 11100010 01100001 01010111 01110001 01010110

I might need to find the first match of any of the following 01, 100, 110, 1110, 1010...

In this case the match would be 1010. I then need to do the same for the remainder of the binary data. The bit strings can be up to 16 bits long and cross the byte boundaries.

Basically, I'm trying to Huffman decode jpegs using the bit strings I created from the Huffman tables in the headers. I can do it, only it's very messy, I'm turning everything, binary data included, into Stringbuffers first and I know that isn't the right way.

Before I loaded everything in string buffers I tried using just numbers in binary but of course I can't ignore the leading 0s in a code like 00011. I'm sure there must be some clever way using bit wise operators and the like to do this, but I've been staring at pages explaining bit masks and leftwise shifts etc and I still don't have a clue!

Thanks a lot for any help!

EDIT:

Thanks for all the suggestions. I've gone with the binary tree approach as it seems to be the standard way with Huffman stuff. Makes sense really as Huffman codes are created using trees. I'll also look into to storing the binary data I need to search in a big integer. Don't know how to mark multiple answers as correct, but thanks all the same.

+1  A: 

You could try stuffing it into a BigInteger then using the shift and test methods. Then use loop to walk and accept each sub pattern.

If the huffman code are in a tree, 1 == right node, 0 == left node.

for( int i =numbitsTotal; i > 0; --i )
{
   int bit = bigInt.testBit( i );
   if( bit == 1 )
   {
       // take right node -- if null accept code, apply from top
   }
   else
   {
      // take left node -- if null accept code, apply from top
   }
}
sfossen
Thanks for the suggestion sfossen. I'm not sure what the advantage of BigInteger is though over just putting the data in a string buffer. Won't a biginteger keep on stripping 0s from the front of the binary bits and the bit codes?
joinJpegs
That data is stores as a bit per bit, versus a byte per bit, so 8x space savings. Since you aren't converting, it wouldn't strip the leading zeros.
sfossen
Sounds good thanks!
joinJpegs
+3  A: 

You might use a state machine consuming zeros and ones. The state machine would have final states for all the patterns that you want to detect. Whenever it enters one of the final states, is sends a message to you with the matched pattern and goes back to the initial state.

Finally you would have only one state machine in form of a DAG which contains all your patterns.

To implement it use the state pattern (http://en.wikipedia.org/wiki/State_pattern) or any other implementation of a state machine.

lewap
This is good, another way to think of it is to store the Huffman codes as a binary tree and you are transverse it. Once you hit a leaf you return the code and start all over again.
Pyrolistical
A: 

You could use a java.util.BitSet to store your binary data and then you can implement some search functions to find the position of a smaller BitSet inside the big one...

Chochos
The BitSet, doesn't have a way to initialize from an array of bytes.
sfossen
Well then he's gonna have to write some incredibly complex code<code>int pos2 = 0;for (int pos1 = 0; pos1 < byteArray.length; pos1++) { for (int i = 128; i > 0; i = i << 1) { if ((byteArray[pos1] } }}</code>
Chochos
+1  A: 

Since you are decoding Huffman encoded-data, you should create a binary tree, where leaves hold the decoded bit string as data, and the bits of each Huffman code are the path to the corresponding data. The bits of the Huffman code are accessed with bit-shift and bit-mask operations. When you get to a leaf, you output the data at that leaf and go back to the root of the tree. It's very fast and efficient.

erickson
A: 

I would suggest a trie. It is explicitly designed for prefix searching. In your case, it would be a binary trie.

Matt J