views:

252

answers:

2

Hello all, I'm writing a compression library as a little side project, and I'm far enough along (My library can extract any standard gzip file, as well as produce compliant (but certainly not yet optimal) gzip output) that it's time to figure out a meaningful block termination strategy. Currently, I just cut the blocks off after every 32k of input (LZ77 window size) because it was conveinent and quick to implement -- now I am going back and trying to actually improve compression efficiency.

The Deflate spec has only this to say about it: "The compressor terminates a block when it determines that starting a new block with fresh trees would be useful, or when the block size fills up the compressor's block buffer", which isn't all that helpful.

I sorted through the SharpZipLib code (as I figured it would be the mosteasily readable open source implementation), and found that it terminates a block every 16k literals of output, ignoring the input. This is easy enough to implement, but it seems like there must be some more targetted approach, especially given the language in the spec "determines that starting a new block with fresh trees would be useful".

So does anyone have any ideas for new strategies, or examples of existing ones?

Thanks in advance!

+1  A: 

As a suggestion to get you going.

A speculative look ahead with a buffer of sufficient size for the indication of superior compression to be worth the change.

This changes the streaming behaviour (more data is required to be input before output occurs) and significantly complicates operations like flush. It is also a considerable extra load in the compression stakes.

In the general case it would be possible to ensure that this produced the optimal output simply by branching at each point where it is possible to start a new block, taking both branches recursing as necessary till all routes are taken. The path that had the nest behaviour wins. This is not likely to be feasible on non trivial input sizes since the choice of when to start a new block is so open.

Simply restricting it to a minimum of 8K output literals but prevent more than 32K literals in a block would result in a relatively tractable basis for trying speculative algorithms. call 8K a sub block.

The simplest of which would be (pseudo code):

create empty sub block called definite
create empty sub block called specChange
create empty sub block called specKeep
target = definite
While (incomingData)
{
  compress data into target(s)    
  if (definite.length % SUB_BLOCK_SIZ) == 0)
  {
    if (targets is definite)
    {
      targets becomes 
        specChange assuming new block 
        specKeep assuming same block as definite
    }        
    else
    {
      if (compression specChange - OVERHEAD better than specKeep)
      {
        flush definite as a block.
        definite = specChange
        specKeep,specChange = empty
        // target remains specKeep,specChange as before 
        but update the meta data associated with specChange to be fresh
      }
      else 
      {
        definite += specKeep
        specKeep,specChange = empty
        // again update the block meta data
        if (definite is MAX_BLOCK_SIZE)
        {
          flush definite
          target becomes definite 
        }
      }
    }
  }
}
take best of specChange/specKeep if non empty and append to definite
flush definite.

OVERHEAD is some constant to account for the cost of switching over blocks

This is rough, and could likely be improved but is a start for analysis if nothing else. Instrument the code for information about what causes a switch, use that to determine good heuristics that a change might be beneficial (perhaps that the compression ratio has dropped significantly).

This could lead to the building of specChange being done only when the heuristic considered it reasonable. If the heuristic turns out be be a strong indicator you could then do away with the speculative nature and simply decide to swap at the point no matter what.

ShuggyCoUk
A: 

Hmm, I like the idea of some heuristic analysis to try to come up with some "rules" for when ending the block might be beneficial. I will look into your suggested approach tonight, and see what I could do with it.

In the meantime, it occurs to me that in order to make a fully informed choice on the issue, I need a better mental picture of the pros and cons of block size decisions. Really quickly I get that smaller blocks allow you to have a potentially better targeted symbol alphabet -- at the cost of increased overhead from defining trees more often. Larger blocks counter their more general symbol alphabet with efficiences of scale (only one tree to store and decode for lots of encoded data).

Off the top of my head, It's not apparent whether the relative distribution of litteral codes vs. length,distance codes would have a specific impact on optimal block size. Good food for thought though.

David Hay