views:

794

answers:

7

I have some extremely large array of integers which i would like to compress.
However the way to do it in java is to use something like this -

int[] myIntArray;
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(1024);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(new DeflaterOutputStream(byteArrayOutputStream));
objectOutputStream.writeObject(myIntArray);

Note that the int array first needs to be converted to bytes by java. Now I know that is fast but it still needs to create a whole new byte array and scan through the entire original int array converting it to bytes and copying the value to the new byte array.

Is there any way to skip the byte conversion and make it compress the integers right away?

+4  A: 

Skip the ObjectOutputStream and just store the ints directly as four bytes each. DataOutputStream.writeInt for instance is an easy way to do it.

Tom Hawtin - tackline
+1  A: 

It doesn't make sense. The compression algorithm needs bytes, not ints.

Stephan Eggermont
+1  A: 

Hmm. A general-purpose compression algorithm won't necessarily do a good job compressing an array of binary values, unless there's a lot of redundancy. You might do better to develop something of your own, based on what you know about the data.

What is it that you're actually trying to compress?

Mark Bessey
+1  A: 

You could use the representation used by Protocol Buffers. Each integer is represented by 1-5 bytes, depending on its magnitude.

Additionally, the new "packed" representation means you get basically a bit of "header" to say how big it is (and which field it's in) and then just the data. That's probably what ObjectOutputStream does as well, but it's a recent innovation in PB :)

Note that this will compress based on magnitude, not based on how often the integer has seen. That will dramatically affect whether it's useful for you or not.

Jon Skeet
A: 

A byte array is not going to save you much memory unless you make it a byte array holding unsigned ints, which is very dangerous in Java. It will replace memory overhead with extra processing time for the step checking of the code. This may be aright for data storage, but there already is data storage solution out there.
Unless you are doing this for serialization purposes, I think that you are wasting your time.

WolfmanDragon
A: 

If the array of ints is guaranteed to have no duplicates, you can use a java.util.BitSet, instead.

As its base implementation is an array of bits, with each bit indicating if a certain integer is present or not in the BitSet, its memory usage is quite low, therefore needing less space to be serialized.

Reginaldo
but bitset would help only if the values are 0 or 1, while i have values ranging uptil integer.maxvalue, with no duplicates of course
pdeva
No, java.util.BitSet can store as many unique integers you need.
Reginaldo
A: 

In your example, you are writing the compressed stream to the ByteArrayOutputStream. Your compressed array needs to exist somewhere, and if the destination is memory, then ByteArrayOutputStream is your likely choice. You could also write the stream to a socket or file. In that case, you wouldn't duplicate the stream in memory. If your array is 800MB and your running in a 1GB, you could easily write the array to a compressed file with the example you included. The change would be replacing the ByteArrayOutputStream with a file stream.

The ObjectOutputStream format is actually fairly efficient. It will not duplicate your array in memory, and has special code for efficiently writing arrays.

Are wanting to work with the compressed array in memory? Would you data lend itself well to a sparse array? Sparse array's are good when you have large gaps in your data.

brianegge