views:

673

answers:

17

Let's say you have a List<List<Boolean>> and you want to encode that into binary form in the most compact way possible.

I don't care about read or write performance. I just want to use the minimal amount of space. Also, the example is in Java, but we are not limited to the Java system. The length of each "List" is unbounded. Therefore any solution that encodes the length of each list must in itself encode a variable length data type.

Related to this problem is encoding of variable length integers. You can think of each List<Boolean> as a variable length unsigned integer.

Please read the question carefully. We are not limited to the Java system.

EDIT

I don't understand why a lot of the answers talk about compression. I am not trying to do compression per se, but just encoding random sequence of bits down. Except each sequence of bits are of different lengths and order needs to be preserved.

You can think of this question in a different way. Lets say you have a list of arbitrary list of random unsigned integers (unbounded). How do you encode this list in a binary file?

Research

I did some reading and found what I really am looking for is Universal code

Result

I am going to use a variant of Elias Omega Coding described in the paper A new recursive universal code of the positive integers

I now understand how the smaller the representation of the smaller integers is a trade off with the larger integers. By simply choosing an Universal code with a "large" representation of the very first integer you save a lot of space in the long run when you need to encode the arbitrary large integers.

A: 

You need Huffman Coding

Mick Sharpe
How does huffman coding help here? Using a prefix code here would only use more space wouldn't it?
Pyrolistical
The prefix codes actually *replace* the symbols they represent. (Though one must also consider the overhead of passing the tree used to decode the shorter prefixes.)
ladenedge
A: 

You could convert each List into a BitSet and then serialize the BitSet-s.

Steve Emmerson
bitsets are finite 32-bits
Pyrolistical
@Pyrolistical: A BitSet uses a "long" to hold the bit information, but it uses as many "long"s as necessary to hold *all* the bits.
Steve Emmerson
My main point is that its a finite data type which won't work as a solution
Pyrolistical
Well obviously no solution will successfully encode an infinite number of digits. Get real please.
Joe Koberg
+2  A: 

I guess for "the most compact way possible" you'll want some compression, but Huffman Coding may not be the way to go as I think it works best with alphabets that have static per-symbol frequencies.

Check out Arithmetic Coding - it operates on bits and can adapt to a dynamic input probabilities. I also see that there is a BSD-licensed Java library that'll do it for you which seems to expect single bits as input.

I suppose for maximum compression you could concatenate each inner list (prefixed with its length) and run the coding algorithm again over the whole lot.

ladenedge
Any compression algorithm that depends on symbol frequency won't work because the Boolean lists are basically random.
Pyrolistical
+3  A: 

I don't see how encoding an arbitrary set of bits differ from compressing/encoding any other form of data. Note that you only impose a loose restriction on the bits you're encoding: namely, they are lists of lists of bits. With this small restriction, this list of bits becomes just data, arbitrary data, and that's what "normal" compression algorithms compress.

Of course, most compression algorithms work on the assumption that the input is repeated in some way in the future (or in the past), as in the LZxx family of compressor, or have a given frequency distribution for symbols.

Given your prerequisites and how compression algorithms work, I would advice doing the following:

  1. Pack the bits of each list using the less possible number of bytes, using bytes as bitfields, encoding the length, etc.
  2. Try huffman, arithmetic, LZxx, etc on the resulting stream of bytes.

One can argue that this is the pretty obvious and easiest way of doing this, and that this won't work as your sequence of bits have no known pattern. But the fact is that this is the best you can do in any scenario.

UNLESS, you know something from your data, or some transformation on those lists that make them raise a pattern of some kind. Take for example the coding of the DCT coefficients in JPEG encoding. The way of listing those coefficients (diagonal and in zig-zag) is made to favor a pattern in the output of the different coefficients for the transformation. This way, traditional compressions can be applied to the resulting data. If you know something of those lists of bits that allow you to re-arrange them in a more-compressible way (a way that shows some more structure), then you'll get compression.

Diego Sevilla
The problem is then to obtain the "bytes" from the data :)
Matthieu M.
Mathieu: I don't think packing bits into bytes is the main problem here anyway :)
Diego Sevilla
Assume the bits are random. So no compression would work. I am looking for a good encoding method, not compression.
Pyrolistical
If the bits are truly random, no encoding method will really work much better than the bit strings themselves. If the integers are unbounded in length and vary in size radically, all you need to do is store pairs <size,bitstring> where size=ceil(log2(bitstring)). The size of "size" is some constant you choose that enables you to represent any large enough bitstring. For most modern computers, size(size) being 32 or so is probably more than adequate.
Ira Baxter
+2  A: 

Theoretical Limits

This is a difficult question to answer without knowing more about the data you intend to compress; the answer to your question could be different with different domains.

For example, from the Limitations section of the Wikipedia article on Lossless Compression:

Lossless data compression algorithms cannot guarantee compression for all input data sets. In other words, for any (lossless) data compression algorithm, there will be an input data set that does not get smaller when processed by the algorithm. This is easily proven with elementary mathematics using a counting argument. ...

Basically, since it's theoretically impossible to compress all possible input data losslessly, it's not even possible to answer your question effectively.

Practical compromise

Just use Huffman, DEFLATE, 7Z, or some ZIP-like off-the-shelf compression algorithm and enocde the bits as variable length byte arrays (or lists, or vectors, or whatever they are called in Java or whatever language you like). Of course, to read the bits back out may require a bit of decompression but that could be done behind the scenes. You can make a class which hides the internal implementation methods to return a list or array of booleans in some range of indices despite the fact that the data is stored internally in pack byte arrays. Updating the boolean at a give index or indices may be a problem but is by no means impossible.

Jared Updike
+3  A: 

I don't know much about Java, so I guess my solution will HAVE to be general :)

1. Compact the lists

Since Booleans are inefficient, each List<Boolean> should be compacted into a List<Byte>, it's easy, just grab them 8 at a time.

The last "byte" may be incomplete, so you need to store how many bits have been encoded of course.

2. Serializing a list of elements

You have 2 ways to proceed: either you encode the number of items of the list, either you use a pattern to mark an end. I would recommend encoding the number of items, the pattern approach requires escaping and it's creepy, plus it's more difficult with packed bits.

To encode the length you can use a variable scheme: ie the number of bytes necessary to encode a length should be proportional to the length, one I already used. You can indicate how many bytes are used to encode the length itself by using a prefix on the first byte:

0... .... > this byte encodes the number of items (7 bits of effective)
10.. .... / .... .... > 2 bytes
110. .... / .... .... / .... .... > 3 bytes

It's quite space efficient, and decoding occurs on whole bytes, so not too difficult. One could remark it's very similar to the UTF8 scheme :)

3. Apply recursively

List< List< Boolean > > becomes [Length Item ... Item] where each Item is itself the representation of a List<Boolean>

4. Zip

I suppose there is a zlib library available for Java, or anything else like deflate or lcw. Pass it your buffer and make sure to precise you wish as much compression as possible, whatever the time it takes.

If there is any repetitive pattern (even ones you did not see) in your representation it should be able to compress it. Don't trust it dumbly though and DO check that the "compressed" form is lighter than the "uncompressed" one, it's not always the case.

5. Examples

Where one notices that keeping track of the edge of the lists is space consuming :)

// Tricky here, we indicate how many bits are used, but they are packed into bytes ;)
List<Boolean> list = [false,false,true,true,false,false,true,true]
encode(list) == [0x08, 0x33] // [00001000, 00110011]  (2 bytes)

// Easier: the length actually indicates the number of elements
List<List<Boolean>> super = [list,list]
encode(super) == [0x02, 0x08, 0x33, 0x08, 0x33] // [00000010, ...] (5 bytes)

6. Space consumption

Suppose we have a List<Boolean> of n booleans, the space consumed to encode it is:

booleans = ceil( n / 8 )

To encode the number of bits (n), we need:

length = 1   for 0    <= n < 2^7   ~ 128
length = 2   for 2^7  <= n < 2^14  ~ 16384
length = 3   for 2^14 <= n < 2^21  ~ 2097152
...
length = ceil( log(n) / 7 )  # for n != 0 ;)

Thus to fully encode a list:

bytes =
 if n == 0: 1
 else     : ceil( log(n) / 7 ) + ceil( n / 8 )

7. Small Lists

There is one corner case though: the low end of the spectrum (ie almost empty list).

For n == 1, bytes is evaluated to 2, which may indeed seem wasteful. I would not however try to guess what will happen once the compression kicks in.

You may wish though to pack even more. It's possible if we abandon the idea of preserving whole bytes...

  1. Keep the length encoding as is (on whole bytes), but do not "pad" the List<Boolean>. A one element list becomes 0000 0001 x (9 bits)
  2. Try to 'pack' the length encoding as well

The second point is more difficult, we are effectively down to a double length encoding:

  1. Indicates how many bits encode the length
  2. Actually encode the length on these bits

For example:

0  -> 0 0
1  -> 0 1
2  -> 10 10
3  -> 10 11
4  -> 110 100
5  -> 110 101
8  -> 1110 1000
16 -> 11110 10000 (=> 1 byte and 2 bits)

It works pretty well for very small lists, but quickly degenerate:

# Original scheme
length = ceil( ( log(n) / 7)

# New scheme
length = 2 * ceil( log(n) )

The breaking point ? 8

Yep, you read it right, it's only better for list with less than 8 elements... and only better by "bits".

n         -> bits spared
[0,1]     ->  6
[2,3]     ->  4
[4,7]     ->  2
[8,15]    ->  0    # Turn point
[16,31]   -> -2
[32,63]   -> -4
[64,127]  -> -6
[128,255] ->  0    # Interesting eh ? That's the whole byte effect!

And of course, once the compression kicks in, chances are it won't really matter.

I understand you may appreciate recursive's algorithm, but I would still advise to compute the figures of the actual space consumption or even better to actually test it with archiving applied on real test sets.

8. Recursive / Variable coding

I have read with interest TheDon's answer, and the link he submitted to Elias Omega Coding.

They are sound answers, in the theoretical domain. Unfortunately they are quite unpractical. The main issue is that they have extremely interesting asymptotic behaviors, but when do we actually need to encode a Gigabyte worth of data ? Rarely if ever.

A recent study of memory usage at work suggested that most containers were used for a dozen items (or a few dozens). Only in some very rare case do we reach the thousand. Of course for your particular problem the best way would be to actually examine your own data and see the distribution of values, but from experience I would say you cannot just concentrate on the high end of the spectrum, because your data lay in the low end.

An example of TheDon's algorithm. Say I have a list [0,1,0,1,0,1,0,1]

len('01010101') = 8 -> 1000
len('1000')     = 4 -> 100
len('100')      = 3 -> 11
len('11')       = 2 -> 10

encode('01010101') = '10' '0' '11' '0' '100' '0' '1000' '1' '01010101'

len(encode('01010101')) = 2 + 1 + 2 + 1 + 3 + 1 + 4 + 1 + 8 = 23

Let's make a small table, with various 'tresholds' to stop the recursion. It represents the number of bits of overhead for various ranges of n.

threshold     2    3    4    5      My proposal
-----------------------------------------------
[0,3]    ->   3    4    5    6           8
[4,7]    ->   10   4    5    6           8
[8,15]   ->   15   9    5    6           8
[16,31]  ->   16   10   5    6           8
[32,63]  ->   17   11   12   6           8
[64,127] ->   18   12   13   14          8
[128,255]->   19   13   14   15         16

To be fair, I concentrated on the low end, and my proposal is suited for this task. I wanted to underline that it's not so clear cut though. Especially because near 1, the log function is almost linear, and thus the recursion loses its charm. The treshold helps tremendously and 3 seems to be a good candidate...

As for Elias omega coding, it's even worse. From the wikipedia article:

17 -> '10 100 10001 0'

That's it, a whooping 11 bits.

Moral: You cannot chose an encoding scheme without considering the data at hand.

So, unless your List<Boolean> have a length in the hundreds, don't bother and stick to my little proposal.

Matthieu M.
+1: I think this more in line with what Pyrolistical is asking for.
James
This is close to what I am looking for. But its as I commented in zneak's answer this doesn't scale well when you mix large and small integers as you have arbitrarily decided to fix your length encoding to mean 1 bit per byte.
Pyrolistical
Honestly... I did not understood your comment :/ Having used this approach it normally scales pretty well, I'll edit the post with some figures if you wish.
Matthieu M.
Your bit encoding scheme can be made considerably better. In your case, encoding a length of 10^2000 would take 6644 bits to represent the number, and an additional 6644 bits to represent the length of that number. A total of 13288 bits. And your byte encoding scheme would take 6644 + 949 bits to encode the number. Recursion would be a more efficient way of storing the lengths, given the unbounded nature of the problem.
TheDon
I think I am slowly coming to visualize what you are talking about... and you are ever so slowly... blasting my mind! I need to make some computations on the gains but it really seems worth it... I'll check it out later.
Matthieu M.
A: 

Well, first off you will want to pack those booleans together so that you are getting eight of them to a byte. C++'s standard bitset was designed for this purpose. You should probably be using it natively instead of vector, if you can.

After that, you could in theory compress it when you save to get the size even smaller. I'd advise against this unless your back is really up against the wall.

I say in theory because it depends a lot on your data. Without knowing anything about your data, I really can't say any more on this, as some algorithms work better than others on certian kinds of data. In fact, simple information theory tells us that in some cases any compression algorithm will produce output that takes up more space than you started with.

If your bitset is rather sparse (not a lot of 0's, or not a lot of 1's), or is streaky (long runs of the same value), then it is possible you could get big gains with compression. In almost every other circumstance it won't be worth the trouble. Even in that circumstance it may not be. Remember that any code you add will need to be debugged and maintained.

T.E.D.
A: 

As you point out, there is no reason to store your boolean values using any more space than a single bit. If you combine that with some basic construct, such as each row begins with an integer coding the number of bits in that row, you'll be able to store a 2D table of any size where each entry in the row is a single bit.

However, this is not enough. A string of arbitrary 1's and 0's will look rather random, and any compression algorithm breaks down as the randomness of your data increases - so I would recommend a process like Burrows-Wheeler Block sorting to greatly increase the amount of repeated "words" or "blocks" in your data. Once that's complete a simple Huffman code or Lempel-Ziv algorithm should be able to compress your file quite nicely.

To allow the above method to work for unsigned integers, you would compress the integers using Delta Codes, then perform the block sorting and compression (a standard practice in Information Retrieval postings lists).

redlightbulb
its not an even 2d array. each row/col can be different
Pyrolistical
+3  A: 

I'd use variable-length integers to encode how many bits there are to read. The MSB would indicate if the next byte is also part of the integer. For instance:

11000101 10010110 00100000

Would actually mean:

   10001 01001011 00100000

Since the integer is continued 2 times.

These variable-length integers would tell how many bits there are to read. And there'd be another variable-length int at the beginning of all to tell how many bit sets there are to read.

From there on, supposing you don't want to use compression, the only way I can see to optimize it size-wise is to adapt it to your situation. If you often have larger bit sets, you might want for instance to use short integers instead of bytes for the variable-length integer encoding, making you potentially waste less bits in the encoding itself.


EDIT I don't think there exists a perfect way to achieve all you want, all at once. You can't create information out of nothing, and if you need variable-length integers, you obviously have to encode the integer length too. There is necessarily a tradeoff between space and information, but there is also minimal information that you can't cut out to use less space. No system where factors grow at different rates will ever scale perfectly. It's like trying to fit a straight line over a logarithmic curve. You can't do that. (And besides, that's pretty much exactly what you're trying to do here.)

You cannot encode the length of the variable-length integer outside of the integer and get unlimited-size variable integers at the same time, because that would require the length itself to be variable-length, and whatever algorithm you choose, it seems common sense to me that you'll be better off with just one variable-length integer instead of two or more of them.

So here is my other idea: in the integer "header", write one 1 for each byte the variable-length integer requires from there. The first 0 denotes the end of the "header" and the beginning of the integer itself.

I'm trying to grasp the exact equation to determine how many bits are required to store a given integer for the two ways I gave, but my logarithms are rusty, so I'll plot it down and edit this message later to include the results.


EDIT 2 Here are the equations:

  • Solution one, 7 bits per encoding bit (one full byte at a time):
    y = 8 * ceil(log(x) / (7 * log(2)))
  • Solution one, 3 bits per encoding bit (one nibble at a time):
    y = 4 * ceil(log(x) / (3 * log(2)))
  • Solution two, 1 byte per encoding bit plus separator:
    y = 9 * ceil(log(x) / (8 * log(2))) + 1
  • Solution two, 1 nibble per encoding bit plus separator:
    y = 5 * ceil(log(x) / (4 * log(2))) + 1

I suggest you take the time to plot them (best viewed with a logarithmic-linear coordinates system) to get the ideal solution for your case, because there is no perfect solution. In my opinion, the first solution has the most stable results.

zneak
This is the closet answer yet, but it doesn't scale at all. If you mix in large and small integers you'll be wasting bits here or there.
Pyrolistical
Are you sure you're looking for something achievable? I don't think there exists a solution that scales perfectly for what you want, save compression. I'll edit my post for my further thoughts.
zneak
+3  A: 

I have a sneaking suspicion that you simply can't encode a truly random set of bits into a more compact form in the worst case. Any kind of RLE is going to inflate the set on just the wrong input even though it'll do well in the average and best cases. Any kind of periodic or content specific approximation is going to lose data.

As one of the other posters stated, you've got to know SOMETHING about the dataset to represent it in a more compact form and / or you've got to accept some loss to get it into a predictable form that can be more compactly expressed.

In my mind, this is an information-theoretic problem with the constraint of infinite information and zero loss. You can't represent the information in a different way and you can't approximate it as something more easily represented. Ergo, you need at least as much space as you have information and no less.

http://en.wikipedia.org/wiki/Information_theory

You could always cheat, I suppose, and manipulate the hardware to encode a discrete range of values on the media to tease out a few more "bits per bit" (think multiplexing). You'd spend more time encoding it and reading it though.

Practically, you could always try the "jiggle" effect where you encode the data multiple times in multiple ways (try interpreting as audio, video, 3d, periodic, sequential, key based, diffs, etc...) and in multiple page sizes and pick the best. You'd be pretty much guaranteed to have the best REASONABLE compression and your worst case would be no worse then your original data set.

Dunno if that would get you the theoretical best though.

James
I am not trying to do compression, just encoding. I know I need to use as many bits as I need to, but _HOW_ should I use those bits?
Pyrolistical
@Pyrolistical: are you concerned about the "data bloat" of converting those boolean values to disk? Ie// how to convert a list of Java Boolean values into a set of disk based booleans? In particular, the encoding of the variable run length? If so, take another look at the way that Matthieu M. is suggesting: encode the data in two passes. First, encode the binary integers as bits, then encode the number of elements in the list. Finally, blast the whole structure out to disk one bit at a time.
James
@Pyrolistical: BTW, lossless compression is just more efficient encoding.
James
A: 

@zneak's answer (beat me to it), but use huffman encoded integers, especially if some lengths are more likely.

Just to be self-contained: Encode the number of lists as a huffman encoded integer, then for each list, encode its bit length as a huffman encoded integer. The bits for each list follow with no intervening wasted bits.

If the order of the lists doesn't matter, sorting them by length would reduce the space needed, only the incremental length increase of each subsequent list need be encoded.

ergosys
+7  A: 

I am thinking of encoding a bit sequence like this:

head  | value
------+------------------
00001 | 0110100111000011

Head has variable length. Its end is marked by the first occurrence of a 1. Count the number of zeroes in head. The length of the value field will be 2 ^ zeroes. Since the length of value is known, this encoding can be repeated. Since the size of head is log value, as the size of the encoded value increases, the overhead converges to 0%.

Addendum

If you want to fine tune the length of value more, you can add another field that stores the exact length of value. The length of the length field could be determined by the length of head. Here is an example with 9 bits.

head  | length | value
------+--------+-----------
00001 | 1001   | 011011001
recursive
I think this might be the answer I am looking for. It scales from small to large integers.
Pyrolistical
Similar to Golomb Coding http://en.wikipedia.org/wiki/Golomb_coding ?
Joe Koberg
It does not work so well: the problem is that you need the exact number of bits in the list. You would have to actually encode: the length of the length, then the length, then the bits... and if you see my answer you'll notice it's only worth it for lists that are less than `8` bits... and not by much.
Matthieu M.
I don't think this is adequate either. For instance, you waste 31 data bits to represent 2^33, whereas you waste only waste 7 data bits using the encodings I suggested yesterday. The encoding bits size vs. data bits size tradeoff is not worth it for huge ranges of numbers.
zneak
zneak: I've moved my comment into the answer. It addresses dealing with lengths other than powers of 2, and still converges to 0% overhead.
recursive
@zneak You are correct this does take more bits for larger sizes.
Pyrolistical
@recursive a few tweaks can save bits. make it so the head counts zeros plus one and in your addendum make length == 2^length of head, but now its a trade off as zneak mentioned earlier. you can always shrink your header, but the encoding of small values suffer.
Pyrolistical
Did some reading and this is a Elias gamma and delta coding. http://en.wikipedia.org/wiki/Elias_delta_coding Its worst actually, because Elias took the fact that you can drop the MSB (duh now that you think about it)
Pyrolistical
@recursive while nobody really came up with the answer I am going to use, I would say you are the closest, so I am giving you the points. Could you provide some insights into my other open question? http://stackoverflow.com/questions/2186519/can-you-encode-to-less-bits-when-you-dont-need-to-preserve-order
Pyrolistical
A: 

If I understood the question correctly, the bits are random, and we have a random-length list of independently random-length lists. Since there is nothing to deal with bytes, I will discuss this as a bit stream. Since files actually contain bytes, you will need to put pack eight bits for each byte and leave the 0..7 bits of the last byte unused.

The most efficient way of storing the boolean values is as-is. Just dump them into the bitstream as a simple array.

In the beginning of the bitstream you need to encode the array lengths. There are many ways to do it and you can save a few bits by choosing the most optimal for your arrays. For this you will probably want to use huffman coding with a fixed codebook so that commonly used and small values get the shortest sequences. If the list is very long, you probably won't care so much about the size of it getting encoded in a longer form that is.

A precise answer as to what the codebook (and thus the huffman code) is going to be cannot be given without more information about the expected list lengths.

If all the inner lists are of the same size (i.e. you have a 2D array), you only need the two dimensions, of course.

Deserializing: decode the lengths and allocate the structures, then read the bits one by one, assigning them to the structure in order.

Tronic
+1  A: 

List-of-Lists-of-Ints-Encoding:

  • When you come to the beginning of a list, write down the bits for ASCII '['. Then proceed into the list.

  • When you come to any arbitrary binary number, write down bits corresponding to the decimal representation of the number in ASCII. For example the number 100, write 0x31 0x30 0x30. Then write the bits corresponding to ASCII ','.

  • When you come to the end of a list, write down the bits for ']'. Then write ASCII ','.

This encoding will encode any arbitrarily-deep nesting of arbitrary-length lists of unbounded integers. If this encoding is not compact enough, follow it up with gzip to eliminate the redundancies in ASCII bit coding.

Joe Koberg
"If you could encode a list of unbounded integers obviously somehow, then you have answered my question."
Joe Koberg
sigh. I knew the moment I typed that you would answer in this fashion. You know what I mean when I say "compactly". There are more compact ways than you have described.
Pyrolistical
No, I don't know what you mean, really. Imagine that instead of ASCII terminators and flags, you used a binary flag string.You can't know (or write down) the lengths of unbounded lists or ints; so prefixing anything with its length is right out. You need to know when lists start. In your special case of a single hierarchy level, the start flag can be the same as the end-previous-list flag - so my encoding is redundant there. You need to know when numbers start (or, again, end). These codes are arbitrary.
Joe Koberg
Tell me this - just exactly what would your input look like? Obviously it must be encoded in some way for the solution to work upon it... So how does your input encoding go exactly?
Joe Koberg
A: 

List-of-List-of-Ints-binary:

Start traversing the input list
For each sublist:
    Output 0xFF 0xFE
    For each item in the sublist:
        Output the item as a stream of bits, LSB first.
          If the pattern 0xFF appears anywhere in the stream,
          replace it with 0xFF 0xFD in the output.
        Output 0xFF 0xFC

Decoding:

If the stream has ended then end any previous list and end reading.
Read bits from input stream. If pattern 0xFF is encountered, read the next 8 bits.
   If they are 0xFE, end any previous list and begin a new one.
   If they are 0xFD, assume that the value 0xFF has been read (discard the 0xFD)
   If they are 0xFC, end any current integer at the bit before the pattern, and begin reading a new one at the bit after the 0xFC.
   Otherwise indicate error. 
Joe Koberg
For more "compactness" you can substitute 0xffffffffffffffffffffffffffffff for 0xff above, and you'll be escaping a lot fewer flags in big integers, but then again every terminator you insert will be that many times longer.
Joe Koberg
You could presumably encode short flags for short integers and long flags for long integers, but then you need to encode some state bits saying what flags to use for the next integer, and there might be a run of 1-bit integers and then you're wasting a whole extra bit for the state bits, or a huge integer that is _ALL_ 0xFF, so add some more modes and some super-state bits... But if you notice. all these improvements come directly from knowledge ("information") about the *structure* of the input. And it's just becoming a lossless compression algorithm.
Joe Koberg
A: 

This question has a certain induction feel to it. You want a function: (bool list list) -> (bool list) such that an inverse function (bool list) -> (bool list list) generates the same original structure, and the length of the encoded bool list is minimal, without imposing restrictions on the input structure. Since this question is so abstract, I'm thinking these lists could be mind bogglingly large - 10^50 maybe, or 10^2000, or they can be very small, like 10^0. Also, there can be a large number of lists, again 10^50 or just 1. So the algorithm needs to adapt to these widely different inputs.

I'm thinking that we can encode the length of each list as a (bool list), and add one extra bool to indicate whether the next sequence is another (now larger) length or the real bitstream.

let encode2d(list1d::Bs) = encode1d(length(list1d), true) @ list1d @ encode2d(Bs)
    encode2d(nil)       = nil

let encode1d(1, nextIsValue) = true :: nextIsValue :: []
    encode1d(len, nextIsValue) = 
               let bitList = toBoolList(len) @ [nextIsValue] in
               encode1d(length(bitList), false) @ bitList

let decode2d(bits) = 
               let (list1d, rest) = decode1d(bits, 1) in
               list1d :: decode2d(rest)

let decode1d(bits, n) = 
               let length = fromBoolList(take(n, bits)) in
               let nextIsValue :: bits' = skip(n, bits) in
               if nextIsValue then bits' else decode1d(bits', length)
assumed library functions
-------------------------

toBoolList : int -> bool list
   this function takes an integer and produces the boolean list representation
   of the bits.  All leading zeroes are removed, except for input '0' 

fromBoolList : bool list -> int
   the inverse of toBoolList

take : int * a' list -> a' list
   returns the first count elements of the list

skip : int * a' list -> a' list
   returns the remainder of the list after removing the first count elements

The overhead is per individual bool list. For an empty list, the overhead is 2 extra list elements. For 10^2000 bools, the overhead would be 6645 + 14 + 5 + 4 + 3 + 2 = 6673 extra list elements.

TheDon
I honestly do not understand the breakdown of your numbers: could use illustrate the computation `6645 + ... = 6673` and perhaps give an example of encoding (for a small list of lists) so that we can better visualize ? I'd be happy to be enlighten but my brain stopped working I fear oO
Matthieu M.
I've tried to understand the breakdown and I can't get past... the problem I face is that the length `2` is encoded in `2` bits. From your decoding you are using `1` bit to begin with, but `1` will either be `0` or `1`, thus making incapable of going anywhere. I think that you need to begin with 2 bits, unless the beginning is treated apart.
Matthieu M.
The http://en.wikipedia.org/wiki/Elias_omega_coding mentioned in the recent question edit solves this better than my solution.
TheDon
No, it's worse. You focus too much on the high end of the spectrum while most of the lists will be at the low end. Even if he were representing a picture `2000x3000` pixels in black and white, the length of the lists would "only" be `3000`. It takes 15 bits to represent `1000` in Elias omega coding while it takes 10 bits in regular binary. Your solution (with a treshold of 4) would do it in 16 bits, as would mine. So `1000` is effectively the breakdown for Elias encoding... and before it's worse.
Matthieu M.
Yeah, having a probability function to show the likelihood of the length of the lists would allow for much better encoding to be written, I agree on that point. Absent such a function, I think it's reasonable to look at very large numbers.
TheDon
A: 

If I understand correctly our data structure is ( 1 2 ( 33483 7 ) 373404 9 ( 337652222 37333788 ) )

Format like so:

byte 255 - escape code
byte 254 - begin block
byte 253 - list separator
byte 252 - end block

So we have:

 struct {
    int nmem; /* Won't overflow -- out of memory first */
    int kind; /* 0 = number, 1 = recurse */
    void *data; /* points to array of bytes for kind 0, array of bigdat for kind 1 */
 } bigdat;

 int serialize(FILE *f, struct bigdat *op) {
   int i;
   if (op->kind) {
      unsigned char *num = (char *)op->data;
      for (i = 0; i < op->nmem; i++) {
         if (num[i] >= 252)
            fputs(255, f);
         fputs(num[i], f);
      }
   } else {
      struct bigdat *blocks = (struct bigdat *)op->data
      fputs(254, f);
      for (i = 0; i < op->nmem; i++) {
          if (i) fputs(253, f);
          serialize(f, blocks[i]);
      }
      fputs(252, f);
 }

There is a law about numeric digit distribution that says for sets of sets of arbitrary unsigned integers, the higher the byte value the less it happens so put special codes at the end.

Not encoding length in front of each takes up far less room, but makes deserialize a difficult exercise.

Joshua