ansaurus

Question

What are some good strategies for determining block size in a deflate algorithm?

Answer 1

+1 A:

As a suggestion to get you going.

A speculative look ahead with a buffer of sufficient size for the indication of superior compression to be worth the change.

This changes the streaming behaviour (more data is required to be input before output occurs) and significantly complicates operations like flush. It is also a considerable extra load in the compression stakes.

In the general case it would be possible to ensure that this produced the optimal output simply by branching at each point where it is possible to start a new block, taking both branches recursing as necessary till all routes are taken. The path that had the nest behaviour wins. This is not likely to be feasible on non trivial input sizes since the choice of when to start a new block is so open.

Simply restricting it to a minimum of 8K output literals but prevent more than 32K literals in a block would result in a relatively tractable basis for trying speculative algorithms. call 8K a sub block.

The simplest of which would be (pseudo code):

create empty sub block called definite
create empty sub block called specChange
create empty sub block called specKeep
target = definite
While (incomingData)
{
  compress data into target(s)    
  if (definite.length % SUB_BLOCK_SIZ) == 0)
  {
    if (targets is definite)
    {
      targets becomes 
        specChange assuming new block 
        specKeep assuming same block as definite
    }        
    else
    {
      if (compression specChange - OVERHEAD better than specKeep)
      {
        flush definite as a block.
        definite = specChange
        specKeep,specChange = empty
        // target remains specKeep,specChange as before 
        but update the meta data associated with specChange to be fresh
      }
      else 
      {
        definite += specKeep
        specKeep,specChange = empty
        // again update the block meta data
        if (definite is MAX_BLOCK_SIZE)
        {
          flush definite
          target becomes definite 
        }
      }
    }
  }
}
take best of specChange/specKeep if non empty and append to definite
flush definite.

OVERHEAD is some constant to account for the cost of switching over blocks

This is rough, and could likely be improved but is a start for analysis if nothing else. Instrument the code for information about what causes a switch, use that to determine good heuristics that a change might be beneficial (perhaps that the compression ratio has dropped significantly).

This could lead to the building of specChange being done only when the heuristic considered it reasonable. If the heuristic turns out be be a strong indicator you could then do away with the speculative nature and simply decide to swap at the point no matter what.

ShuggyCoUk 2009-01-27 19:15:05

Answer 2

A:

Hmm, I like the idea of some heuristic analysis to try to come up with some "rules" for when ending the block might be beneficial. I will look into your suggested approach tonight, and see what I could do with it.

In the meantime, it occurs to me that in order to make a fully informed choice on the issue, I need a better mental picture of the pros and cons of block size decisions. Really quickly I get that smaller blocks allow you to have a potentially better targeted symbol alphabet -- at the cost of increased overhead from defining trees more often. Larger blocks counter their more general symbol alphabet with efficiences of scale (only one tree to store and decode for lots of encoded data).

Off the top of my head, It's not apparent whether the relative distribution of litteral codes vs. length,distance codes would have a specific impact on optimal block size. Good food for thought though.

David Hay 2009-01-27 19:58:33

ansaurus

tags:

views:

answers:

What are some good strategies for determining block size in a deflate algorithm?

related questions