views:

347

answers:

4

Suppose you have a large file made up of a bunch of fixed size blocks. Each of these blocks contains some number of variable sized records. Each record must fit completely within a single block and then such records by definition are never larger than a full block. Over time, records are added to and deleted from these blocks as records come and go from this "database".

At some point, especially after perhaps many records are added to the database and several are removed - many of the blocks may end up only partially filled.

What is a good algorithm to shuffle the records around in this database to compact out unnecessary blocks at the end of the file by better filling up the partially filled blocks?

Requirements of the algorithm:

  • The compaction must happen in place of the original file without temporarily extending the file by more than a few blocks at most from its starting size
  • The algorithm should not unnecessarily disturb blocks that are already mainly full
  • Entire blocks must be read or written from/to the file at one time and one should assume the write operation is relatively expensive
  • If records are moved from one block to another they must be added at their new location before being removed from their starting position so that in case the operation is interrupted no records are lost as a result of the "failed" compaction. (Assume that this temporary duplication of such records can be detected at recovery).
  • The memory that can be used for this operation can only be on the order of perhaps several blocks which is a very small percentage of the overall file size
  • Assume that records are on the order of 10 bytes to 1K bytes with an average size of maybe 100 bytes. The fixed sized blocks are on the order of 4K or 8K and that the file is on the order of 1000's of blocks.
+2  A: 

If there is no ordering to these records, I'd simply fill the blocks from the front with records extracted from the last block(s). This will minimize movement of data, is fairly simple, and should do a decent job of packing data tightly.

E.g.:

// records should be sorted by size in memory (probably in a balanced BST)
records = read last N blocks on disk;

foreach (block in blocks) // read from disk into memory
{
    if (block.hasBeenReadFrom())
    {
        // we read from this into records already
        // all remaining records are already in memory

        writeAllToNewBlocks(records);

        // this will leave some empty blocks on the disk that can either
        // be eliminated programmatically or left alone and filled during
        // normal operation

        foreach (record in records)
        {
            record.eraseFromOriginalLocation();
        }

        break;
    }

    while(!block.full())
    {
        moveRecords = new Array; // list of records we've moved

        size = block.availableSpace();
        record = records.extractBestFit(size);
        if (record == null)
        {
            break;
        }

        moveRecords.add(record);
        block.add(record);

        if (records.gettingLow())
        {
            records.readMoreFromDisk();
        }
    }

    if(moveRecords.size() > 0)
    {
        block.writeBackToDisk();
        foreach (record in moveRecords)
        {
            record.eraseFromOriginalLocation();
        }
    }
}

Update: I neglected to maintain the no-blocks-only-in-memory rule. I've updated the pseudocode to fix this. Also fixed a glitch in my loop condition.

Derek Park
This approach is basically where we started, but it proves to be that the irregularity of the record sizes often leaves behind sub-optimally compacted blocks. With some willingness to search around more, better fits can be found but then its turning into NP-hard and now looking for more heuristics..
Tall Jeff
In any case, thanks for the suggestion!
Tall Jeff
You're welcome. I would expect that tuning the number of blocks you keep in memory at once would help. If you keep, say, ten blocks worth of records in memory, I'd expect that you could generally fill most partial blocks pretty well.
Derek Park
You could also hold only smaller records in memory, and compact the leftover large records afterward (so you don't end up filling you entire memory with large records over time).
Derek Park
+2  A: 

This sounds like a variation of the bin packing problem, but where you already have an inferior allocation that you want to improve. So I suggest looking at variations of the approaches which are successful for the bin packing problem.

First of all, you probably want to parameterize your problem by defining what you consider "full enough" (where a block is full enough that you don't want to touch it), and what is "too empty" (where a block has so much empty space that it has to have more records added to it). Then, you can classify all the blocks as full enough, too empty, or partially full (those that are neither full enough nor too empty). You then redefine the problem as how to eliminate all the too empty blocks by creating as many full enough blocks as possible while minimising the number of partially full blocks.

You'll also want to work out what's more important: getting the records into the fewest blocks possible, or packing them adequately but with the least amount of blocks read and written.

My approach would be to make an initial pass over all the blocks, to classify them all into one of the three classes defined above. For each block, you also want to keep track of the free space available in it, and for the too empty blocks, you'll want to have a list of all the records and their sizes. Then, starting with the largest records in the too empty blocks, move them into the partially full blocks. If you want to minimise reads and writes, move them into any of the blocks you currently have in memory. If you want to minimise wasted space, find the block with the least amount of empty space that will still hold the record, reading the block in if necessary. If no block will hold the record, create a new block. If a block in memory reaches the "full enough" threshold, write it out. Repeat until all the records in the partially filled blocks have been placed.

I've skipped over more than a few details, but this should give you some idea. Note that the bin packing problem is NP-hard, which means that for practical applications, you'll want to decide what's most important to you, so you can pick an approach that will give you an approximately best solution in reasonable time.

TimB
Thanks for pointing out the bin packing problem comparison. This is helpful. A tricky part to a solution is the sheer volume of records to scan in the first pass and hold stats for is prohibitive. Also because writes are expensive, in a sense you only get a chance or two re-write a given block
Tall Jeff
Ideally, I'm thinking of some number of passes are used to target certain blocks and records for consolidation. ie: Find the lowest hanging fruit of each pass to harvest and stop when too much time or no significant optimizations remain. Thanks again!
Tall Jeff
A: 

Here's an algorithm you might be able to leverage, albeit your records within fixed size blocks might require a little bit more work.

Heap Defragmentation in Bounded Time

Richard
Thanks. However, as you alluded to, that paper does not address two dynamics of the problem. 1) The bin nature of the fixed sized blocks and (2) the linear byte by byte variability in the record sizes. ie: The 2^N nature of the "records" in the heap is a key element of their solution.
Tall Jeff
+2  A: 

A modification of an on-line (to defragment in one pass) bounded space (the memory requirements) bin packing algorithm could probably work here.

See "Bin Packing Approximation Algorithms: Combinatorial Analysis" by Coffman et al.

Rafał Dowgird
Thanks! for the reference to the bin packing problem comparison and this paper on the analysis of the approach to some various approaches.
Tall Jeff