views:

167

answers:

5

I'm looking for a general compression library that supports random access during decompression. I want to compress wikipedia into a single compressed format and at the same time I want to decompress/extract individual articles from it.

Of course, I can compress each articles individually, but this won't give much compression ratio. I've heard LZO compressed file consists of many chunks which can be decompressed separately, but I haven't found out API+documentation for that. I can also use the Z_FULL_FLUSH mode in zlib, but is there any other better alternative?

A: 

You haven't specified your OS. Would it be possible to store your file in a compressed directory managed by the OS? Then you would have the "seekable" portion as well as the compression. The CPU overhead will be handled for you with unpredictable access times.

No Refunds No Returns
I'd prefer a portable library among different OS. Compressed file system is certainly a solution, but does that perform well (in terms of speed and memory) under random access?
Wu Yongzheng
you're trading space for speed. COmpression costs.
No Refunds No Returns
+1  A: 

DotNetZip is a zip archive library for .NET.

Using DotNetZip, you can reference particular entries in the zip randomly, and can decompress them out of order, and can return a stream that decompresses as it extracts an entry.

With the benefit of those features, DotNetZip has been used within the implementation of a Virtual Path Provider for ASP.NET, that does exactly what you describe - it serves all the content for a particular website from a compressed ZIP file. You can also do websites with dynamic pages (ASP.NET) pages.

ASP.NET ZIP Virtual Path Provider, based on DotNetZip

The important code looks like this:

namespace Ionic.Zip.Web.VirtualPathProvider
{
    public class ZipFileVirtualPathProvider : System.Web.Hosting.VirtualPathProvider
    {
        ZipFile _zipFile;

        public ZipFileVirtualPathProvider (string zipFilename) : base () {
            _zipFile =  ZipFile.Read(zipFilename);
        }

        ~ZipFileVirtualPathProvider () { _zipFile.Dispose (); }

        public override bool FileExists (string virtualPath)
        {
            string zipPath = Util.ConvertVirtualPathToZipPath (virtualPath, true);
            ZipEntry zipEntry = _zipFile[zipPath];

            if (zipEntry == null)
                return false;

            return !zipEntry.IsDirectory;
        }

        public override bool DirectoryExists (string virtualDir)
        {
            string zipPath = Util.ConvertVirtualPathToZipPath (virtualDir, false);
            ZipEntry zipEntry = _zipFile[zipPath];

            if (zipEntry != null)
                return false;

            return zipEntry.IsDirectory;
        }

        public override VirtualFile GetFile (string virtualPath)
        {
            return new ZipVirtualFile (virtualPath, _zipFile);
        }

        public override VirtualDirectory GetDirectory (string virtualDir)
        {
            return new ZipVirtualDirectory (virtualDir, _zipFile);
        }

        public override string GetFileHash(string virtualPath, System.Collections.IEnumerable virtualPathDependencies)
        {
            return null;
        }

        public override System.Web.Caching.CacheDependency GetCacheDependency(String virtualPath, System.Collections.IEnumerable virtualPathDependencies, DateTime utcStart)
        {
            return null;
        }
    }
}

And VirtualFile is defined like this:

namespace Ionic.Zip.Web.VirtualPathProvider
{
    class ZipVirtualFile : VirtualFile
    {
        ZipFile _zipFile;

        public ZipVirtualFile (String virtualPath, ZipFile zipFile) : base(virtualPath) {
            _zipFile = zipFile;
        }

        public override System.IO.Stream Open () 
        {
            ZipEntry entry = _zipFile[Util.ConvertVirtualPathToZipPath(base.VirtualPath,true)];
            return entry.OpenReader();
        }
    }
}
Cheeso
A: 

I'm using MS Windows Vista, unfortunately, and I can send the file explorer into zip files as if they were normal files. Presumably it still works on 7 (which I'd like to be on). I think I've done that with the corresponding utility on Ubuntu, also, but I'm not sure. I could also test it on Mac OSX, I suppose.

David Thornley
A: 

If individual articles are too short to get a decent compression ratio, the next-simplest approach is to tar up a batch of Wikipedia articles -- say, 12 articles at a time, or however many articles it takes to fill up a megabyte. Then compress each batch independently.

In principle, that gives better compression than than compressing each article individually, but worse compression than solid compression of all the articles together. Extracting article #12 from a compressed batch requires decompressing the entire batch (and then throwing the first 11 articles away), but that's still much, much faster than decompressing half of Wikipedia.

Many compression programs break up the input stream into a sequence of "blocks", and compress each block from scratch, independently of the other blocks. You might as well pick a batch size about the size of a block -- larger batches won't get any better compression ratio, and will take longer to decompress.

I have experimented with several ways to make it easier to start decoding a compressed database in the middle. Alas, so far the "clever" techniques I've applied still have worse compression ratio and take more operations to produce a decoded section than the much simpler "batch" approach.

For more sophisticated techniques, you might look at

David Cary
+1  A: 

for seekable compression build on gzip, there is dictzip from the dict server and sgzip from sleuth kit

note that you can't write to either of these and as seekable is reading any way

Dan D