ansaurus

Question

C Library for compressing sequential positive integers

Answer 1

A:

What exactly are you trying to compress? If you are thinking about the total space of index, is it really worth the effort to save the space?

If so one thing you could try is to chop the space into half and store it into two tables. First stores (upper uint, start index, length, pointer to second table) and the second would store (index, lower uint).

For fast searching, indices would be implemented using something like B+ Tree.

eed3si9n 2009-07-04 20:22:28

Answer 2

A:

You have two conflicting requirements:

You want to compress very small items (8 bytes each).
You need efficient random access for each item.

The second requirement is very likely to impose a fixed length for each item.

Mehrdad Afshari 2009-07-04 20:36:56

Although I will do random access in this data, it doesn't necessarily needs to be O(1). For example, compressing the numbers in blocks containing 64 values each and keeping a tree to find which block to decompress may give me significant compression and fasat enough access. As I said, the question is more about finding a library with readily available encoding algorithms such as delta-encoding, elias-gamma and probably the block piece to play with.See the question http://stackoverflow.com/questions/523733/compress-sorted-integers and in particular simmon's comment for a another explation.

Davi 2009-07-04 22:06:12

I understand. It *is* possible but maintaining the tree in memory itself is relatively costly considering the **small size** of items.

Mehrdad Afshari 2009-07-04 22:10:06

Answer 3

A:

Norman Ramsey 2009-07-04 22:44:23

Hi Norman,The number of strings is in the order of hundreds of millions and average length would be 10k. Maximum length is 2^31.The full set of strings (the values) do not fit in memory and they cannot be re-ordered.More on the practical side, I am using this to build a library, so I don't really have control over the input. These numbers represent use cases I have seen in the past (web pages).

Davi 2009-07-05 01:04:37

Yow! OK, your edit makes the problem much clearer. If you find a solution off the shelf I'll be very impressed. I've added a couple of links to my answer.

Norman Ramsey 2009-07-05 05:45:33

Answer 4

A:

I did something similar years ago for a full-text search engine. In my case, each indexed word generated a record which consisted of a record number (document id) and a word number (it could just as easily have stored word offsets) which needed to be compressed as much as possible. I used a delta-compression technique which took advantage of the fact that there would be a number of occurrences of the same word within a document, so the record number often did not need to be repeated at all. And the word offset delta would often fit within one or two bytes. Here is the code I used.

Since it's in C++, the code may is not going to be useful to you as is, but can be a good starting point for writing compressions routines.

Please excuse the hungarian notation and the magic numbers strewn within the code. Like I said, I wrote this many years ago :-)

IndexCompressor.h

//
// index compressor class
//

#pragma once

#include "File.h"

const int IC_BUFFER_SIZE = 8192;

//
// index compressor
//
class IndexCompressor
{
private :
   File        *m_pFile;
   WA_DWORD    m_dwRecNo;
   WA_DWORD    m_dwWordNo;
   WA_DWORD    m_dwRecordCount;
   WA_DWORD    m_dwHitCount;

   WA_BYTE     m_byBuffer[IC_BUFFER_SIZE];
   WA_DWORD    m_dwBytes;

   bool        m_bDebugDump;

   void FlushBuffer(void);

public :
   IndexCompressor(void) { m_pFile = 0; m_bDebugDump = false; }
   ~IndexCompressor(void) {}

   void Attach(File& File) { m_pFile = &File; }

   void Begin(void);
   void Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo);
   void End(void);

   WA_DWORD GetRecordCount(void) { return m_dwRecordCount; }
   WA_DWORD GetHitCount(void) { return m_dwHitCount; }

   void DebugDump(void) { m_bDebugDump = true; }
};

IndexCompressor.cpp

//
// index compressor class
//

#include "stdafx.h"
#include "IndexCompressor.h"

void IndexCompressor::FlushBuffer(void)
{
   ASSERT(m_pFile != 0);

   if (m_dwBytes > 0)
   {
      m_pFile->Write(m_byBuffer, m_dwBytes);
      m_dwBytes = 0;
   }
}

void IndexCompressor::Begin(void)
{
   ASSERT(m_pFile != 0);
   m_dwRecNo = m_dwWordNo = m_dwRecordCount = m_dwHitCount = 0;
   m_dwBytes = 0;
}

void IndexCompressor::Add(WA_DWORD dwRecNo, WA_DWORD dwWordNo)
{
   ASSERT(m_pFile != 0);
   WA_BYTE buffer[16];
   int nbytes = 1;

   ASSERT(dwRecNo >= m_dwRecNo);

   if (dwRecNo != m_dwRecNo)
      m_dwWordNo = 0;
   if (m_dwRecordCount == 0 || dwRecNo != m_dwRecNo)
      ++m_dwRecordCount;
   ++m_dwHitCount;

   WA_DWORD dwRecNoDelta = dwRecNo - m_dwRecNo;
   WA_DWORD dwWordNoDelta = dwWordNo - m_dwWordNo;

   if (m_bDebugDump)
   {
      TRACE("%8X[%8X] %8X[%8X] : ", dwRecNo, dwRecNoDelta, dwWordNo, dwWordNoDelta);
   }

   // 1WWWWWWW
   if (dwRecNoDelta == 0 && dwWordNoDelta < 128)
   {
      buffer[0] = 0x80 | WA_BYTE(dwWordNoDelta);
   }
   // 01WWWWWW WWWWWWWW
   else if (dwRecNoDelta == 0 && dwWordNoDelta < 16384)
   {
      buffer[0] = 0x40 | WA_BYTE(dwWordNoDelta >> 8);
      buffer[1] = WA_BYTE(dwWordNoDelta & 0x00ff);
      nbytes += sizeof(WA_BYTE);
   }
   // 001RRRRR WWWWWWWW WWWWWWWW
   else if (dwRecNoDelta < 32 && dwWordNoDelta < 65536)
   {
      buffer[0] = 0x20 | WA_BYTE(dwRecNoDelta);
      WA_WORD *p = (WA_WORD *) (buffer+1);
      *p = WA_WORD(dwWordNoDelta);
      nbytes += sizeof(WA_WORD);
   }
   else
   {
      // 0001rrww
      buffer[0] = 0x10;

      // encode recno
      if (dwRecNoDelta < 256)
      {
         buffer[nbytes] = WA_BYTE(dwRecNoDelta);
         nbytes += sizeof(WA_BYTE);
      }
      else if (dwRecNoDelta < 65536)
      {
         buffer[0] |= 0x04;
         WA_WORD *p = (WA_WORD *) (buffer+nbytes);
         *p = WA_WORD(dwRecNoDelta);
         nbytes += sizeof(WA_WORD);
      }
      else
      {
         buffer[0] |= 0x08;
         WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
         *p = dwRecNoDelta;
         nbytes += sizeof(WA_DWORD);
      }

      // encode wordno
      if (dwWordNoDelta < 256)
      {
         buffer[nbytes] = WA_BYTE(dwWordNoDelta);
         nbytes += sizeof(WA_BYTE);
      }
      else if (dwWordNoDelta < 65536)
      {
         buffer[0] |= 0x01;
         WA_WORD *p = (WA_WORD *) (buffer+nbytes);
         *p = WA_WORD(dwWordNoDelta);
         nbytes += sizeof(WA_WORD);
      }
      else
      {
         buffer[0] |= 0x02;
         WA_DWORD *p = (WA_DWORD *) (buffer+nbytes);
         *p = dwWordNoDelta;
         nbytes += sizeof(WA_DWORD);
      }
   }

   // update current setting
   m_dwRecNo = dwRecNo;
   m_dwWordNo = dwWordNo;

   // add compressed data to buffer
   ASSERT(buffer[0] != 0);
   ASSERT(nbytes > 0 && nbytes < 10);
   if (m_dwBytes + nbytes > IC_BUFFER_SIZE)
      FlushBuffer();
   CopyMemory(m_byBuffer + m_dwBytes, buffer, nbytes);
   m_dwBytes += nbytes;

   if (m_bDebugDump)
   {
      for (int i = 0; i < nbytes; ++i)
         TRACE("%02X ", buffer[i]);
      TRACE("\n");
   }
}

void IndexCompressor::End(void)
{
   FlushBuffer();
   m_pFile->Write(WA_BYTE(0));
}

Ferruccio 2009-07-05 01:32:05

Answer 5

+3 A:

I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle's BBC (byte aligned bitmap code, berkeleydb). It's easy to setup and very good gernally.

However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.

Daniel Lemire has a number of libraries for C/++/Java released on code.google, I've read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes's.

Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,

Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX.

As you referred to CDB, the TC benchmark has a TC mode (TC support's several operational constraint's for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.

With respect to your delta encoding requirement, I am quite confident in bsdiff and it's ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.

Google's new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff's than bsdiff in the one test case I have seen published.

RandomNickName42 2009-07-05 12:41:24

Hi RandomNickName42, Thanks for the pointer. Looks like a very promising candidate. For the record, I also found a similar library here:http://code.google.com/p/lemurbitmapindex/I will give both a try.

Davi 2009-07-05 15:42:50

It looks like it is patented. http://www.freepatentsonline.com/6831575.html. Could matter.

chmike 2009-07-06 05:40:37

https://codeforge.lbl.gov/projects/fastbit/ is a dev-site for fastbit, LGPL, I guess not being BSD or MS-PL may be some issue, but the L in LGPL is some comphert. ;)

RandomNickName42 2009-07-06 06:59:45

Hi RandomNickName42, thanks for the edit. I am exactly developing a readonly-competitor for TokyoCabinet.

Davi 2009-07-18 14:22:17

Answer 6

A:

Are you running on Windows? If so, I recommend creating the mmap file using naive solution your originally proposed, and then compressing the file using NTLM compression. Your application code never knows the file is compressed, and the OS does the file compression for you. You might not think this would be very performant or get good compression, but I think you'll be surprised if you try it.

brianegge 2009-07-07 01:54:37

ansaurus

tags:

views:

answers:

C Library for compressing sequential positive integers

IndexCompressor.h

IndexCompressor.cpp

related questions