views:

391

answers:

5

I have a program that reads and writes very large text files. However, because of the format of these files (they are ASCII representations of what should have been binary data), these files are actually very easily compressed. For example, some of these files are over 10GB in size, but gzip achieves 95% compression.

I can't modify the program but disk space is precious, so I need to set up a way that it can read and write these files while they're being transparently compressed and decompressed.

The program can only read and write files, so as far as I understand, I need to set up a named pipe for both input and output. Some people are suggesting a compressed filesystem instead, which seems like it would work, too. How do I make either work?

Technical information: I'm on a modern Linux. The program reads a separate input and output file. It reads through the input file in order, though twice. It writes the output file in order.

+1  A: 

named pipes won't give you full duplex operations, so it will be a little bit more complicated if you need to provide just one filename.

Do you know if your applications needs to seek through the file ?

Does your application work with stdin, stdout ?

Maybe a solution is to create a mini compressed file system that contains only a directory with your files

Since you have separate input and output file you can do the following :

mkfifo readfifo
mkfifo writefifo
zcat your inputfile > readfifo &
gzip writefifo > youroutputfile &

launch your program !

Now, you probably will get in trouble with the read twice in order of the input, because as soon as zcat is finished reading the input file, yout program will get a SIGPIPE signal

The proper solution is probably to use a compressed file system like CompFUSE, because then you don't have to worry about unsupported operations like seek.

shodanex
I've edited my question to address your inquiries. The program does not read or write stdin/out.
A. Rex
A: 

Disk space is not precious, disk space is in fact incredibly cheap these days.

Please consider the relative costs of adding a 1TB disk to your system versus the cost of coding up something that does the compression on the fly.

In any event, if the program only handles reading and writing one such pair of files at a time, you only need a few extra 10GB chunks and then you can just decompress -> run -> compress

Alnitak
The program runs a breadth-first search on some search space. Every time the depth increases by 1, there are 100 times more nodes. If I run through the 12 GB file, it will create a 1 TB file. If I compress that on the fly, it will be 51 GB instead.
A. Rex
I understand disk space is comparatively cheap, but no matter how you slice or dice this, I *will* be able to search one level farther if I do this compression.
A. Rex
then you should have said so in the question - you only mentioned 10 GB files there.
Alnitak
+3  A: 

Check out zlibc: http://zlibc.linux.lu/.

Also, if FUSE is an option (i.e. the kernel is not too old), consider: compFUSEd http://www.biggerbytes.be/

EFraim
Can I write with zlibc, too? It's as crucial that I can write as read.
A. Rex
zlibc is mainly for writing new programs that compress, and you said you couldn't touch your program. I voted this one up for the mention of compuFUSEd, that sounds like a good fit for your problem.
unwind
zlibc is read-only, but definitely can be used without recompiling too, through LD_PRELOAD mechanism.
EFraim
A: 

Which language are you using?

If you are using Java, take a look at GZipInputStream and GZipOutputStream classes in the API doc.

If you are using C/C++, zlibc is probably the best way to go about it.

trshiv
I cannot change the program, so this must work outside of the program. I'm cool with any language, but I thought this was more working with Linux than any programming.
A. Rex