ansaurus

Question

What is a good algorithm for compacting records in a blocked file?

Answer 1

+2 A:

If there is no ordering to these records, I'd simply fill the blocks from the front with records extracted from the last block(s). This will minimize movement of data, is fairly simple, and should do a decent job of packing data tightly.

E.g.:

// records should be sorted by size in memory (probably in a balanced BST)
records = read last N blocks on disk;

foreach (block in blocks) // read from disk into memory
{
    if (block.hasBeenReadFrom())
    {
        // we read from this into records already
        // all remaining records are already in memory

        writeAllToNewBlocks(records);

        // this will leave some empty blocks on the disk that can either
        // be eliminated programmatically or left alone and filled during
        // normal operation

        foreach (record in records)
        {
            record.eraseFromOriginalLocation();
        }

        break;
    }

    while(!block.full())
    {
        moveRecords = new Array; // list of records we've moved

        size = block.availableSpace();
        record = records.extractBestFit(size);
        if (record == null)
        {
            break;
        }

        moveRecords.add(record);
        block.add(record);

        if (records.gettingLow())
        {
            records.readMoreFromDisk();
        }
    }

    if(moveRecords.size() > 0)
    {
        block.writeBackToDisk();
        foreach (record in moveRecords)
        {
            record.eraseFromOriginalLocation();
        }
    }
}

Update: I neglected to maintain the no-blocks-only-in-memory rule. I've updated the pseudocode to fix this. Also fixed a glitch in my loop condition.

Derek Park 2008-09-24 22:29:40

This approach is basically where we started, but it proves to be that the irregularity of the record sizes often leaves behind sub-optimally compacted blocks. With some willingness to search around more, better fits can be found but then its turning into NP-hard and now looking for more heuristics..

Tall Jeff 2008-09-25 18:06:04

In any case, thanks for the suggestion!

Tall Jeff 2008-09-25 18:06:37

You're welcome. I would expect that tuning the number of blocks you keep in memory at once would help. If you keep, say, ten blocks worth of records in memory, I'd expect that you could generally fill most partial blocks pretty well.

Derek Park 2008-09-25 18:27:40

You could also hold only smaller records in memory, and compact the leftover large records afterward (so you don't end up filling you entire memory with large records over time).

Derek Park 2008-09-25 18:28:28

Answer 2

+2 A:

This sounds like a variation of the bin packing problem, but where you already have an inferior allocation that you want to improve. So I suggest looking at variations of the approaches which are successful for the bin packing problem.

First of all, you probably want to parameterize your problem by defining what you consider "full enough" (where a block is full enough that you don't want to touch it), and what is "too empty" (where a block has so much empty space that it has to have more records added to it). Then, you can classify all the blocks as full enough, too empty, or partially full (those that are neither full enough nor too empty). You then redefine the problem as how to eliminate all the too empty blocks by creating as many full enough blocks as possible while minimising the number of partially full blocks.

You'll also want to work out what's more important: getting the records into the fewest blocks possible, or packing them adequately but with the least amount of blocks read and written.

My approach would be to make an initial pass over all the blocks, to classify them all into one of the three classes defined above. For each block, you also want to keep track of the free space available in it, and for the too empty blocks, you'll want to have a list of all the records and their sizes. Then, starting with the largest records in the too empty blocks, move them into the partially full blocks. If you want to minimise reads and writes, move them into any of the blocks you currently have in memory. If you want to minimise wasted space, find the block with the least amount of empty space that will still hold the record, reading the block in if necessary. If no block will hold the record, create a new block. If a block in memory reaches the "full enough" threshold, write it out. Repeat until all the records in the partially filled blocks have been placed.

I've skipped over more than a few details, but this should give you some idea. Note that the bin packing problem is NP-hard, which means that for practical applications, you'll want to decide what's most important to you, so you can pick an approach that will give you an approximately best solution in reasonable time.

TimB 2008-09-24 22:54:37

Thanks for pointing out the bin packing problem comparison. This is helpful. A tricky part to a solution is the sheer volume of records to scan in the first pass and hold stats for is prohibitive. Also because writes are expensive, in a sense you only get a chance or two re-write a given block

Tall Jeff 2008-09-25 18:00:02

Ideally, I'm thinking of some number of passes are used to target certain blocks and records for consolidation. ie: Find the lowest hanging fruit of each pass to harvest and stop when too much time or no significant optimizations remain. Thanks again!

Tall Jeff 2008-09-25 18:02:37

Answer 3

A:

Here's an algorithm you might be able to leverage, albeit your records within fixed size blocks might require a little bit more work.

Heap Defragmentation in Bounded Time

Richard 2008-09-25 00:35:22

Thanks. However, as you alluded to, that paper does not address two dynamics of the problem. 1) The bin nature of the fixed sized blocks and (2) the linear byte by byte variability in the record sizes. ie: The 2^N nature of the "records" in the heap is a key element of their solution.

Tall Jeff 2008-09-25 18:14:03

Answer 4

+2 A:

A modification of an on-line (to defragment in one pass) bounded space (the memory requirements) bin packing algorithm could probably work here.

See "Bin Packing Approximation Algorithms: Combinatorial Analysis" by Coffman et al.

Rafał Dowgird 2008-09-25 08:10:08

Thanks! for the reference to the bin packing problem comparison and this paper on the analysis of the approach to some various approaches.

Tall Jeff 2008-09-25 18:15:40

ansaurus

tags:

views:

answers:

What is a good algorithm for compacting records in a blocked file?

related questions