views:

129

answers:

1

I'm working on Scala with VERY larg lists of Int (maybe large) and I need to compress them and to hold it in memory.

The only requirement is that I can pull (and decompress) the first number on the list to work with, whithout touching the rest of the list.

I have many good ideas but most of them translate the numbers to bits. Example:

you can write any number x as the tuple |log(x)|,x-|log(x)| the first element we right it as a string of 1's and a 0 at the end (Unary Code) and the second in binary. e.g:

1 -> 0,1 -> 0 1

...

5 -> 2,1 -> 110 01

...

8 -> 3,0 -> 1110 000

9 -> 3,1 -> 1110 001

...

While a Int takes a fixed 32 bits of memory and a long 64, with this compression x requires 2log(x) bits for storage and can grow indefinetly. This Compression does reducememory in most cases.

How would you handle such type of data? Is there something such as bitarray or something?

Any other way to compress such data in Scala?

Thanks

+2  A: 

Depending on the sparseness and range of your data set, you may keep your data as a list of deltas instead of numbers. That's used for sound compression, for instance, and can be both lossy or lossless, depending on your needs.

For instance, if you have Int numbers but know they will hardly ever be more than a (signed) Byte apart, you could do something like this list of bytes:

-1           // Use -1 to imply the next number cannot be computed as a byte delta
0, 0, 4, 0   // 1024 encoded as bytes
1            // 1025 as a delta
-5           // 1020 as a delta
-1           // Next number can't be computed as a byte delta
0, 0, -1, -1 // 65535 encoded as bytes -- -1 doesn't have special meaning here
10           // 65545 as a delta

So you don't have to handle bits using this particular encoding. But, really, you won't get good answers without a very clear indication of the particular problem, the characteristics of the data, etc.

Rereading your question, it seems you are not discarding compression techniques that turn data into bits. If not, then I suggest Huffman -- predictive if needed -- or something from the Lempel-Ziv family.

And, no, Scala has no library to handle binary data, unfortunately. Though paulp probably has something like that in the compiler itself.

Daniel
Thanks, that is a good idea.I guess we could use an array of bytes to "package" bits. It would serve as a list of bits (packed in groups of 8). Do you think this could work in terms of performance?
Skuge