I need a solution that allows me to create compressed data files (gzip, zip. tar etc. - any format could work) and then freely append data to them without having to load the whole file in memory and re-compress it (seeking while decompressing would be awesome as well). Anyone has a suggestion on .NET?
Have you seen GZipStream class? You can use it as any other stream.
Maybe I have some suggestions for you.
First of all, why you seek programmatic solution implemented on your own?
You can simply split large log file into chunks, i.e. on per-hour or even per-minute basis and gather them in separate dictionary on per-day basis (in order to not clutter FS with huge number of files in one directory), so instead having considerable large file you have to process and seek, you'll have many small files, which can be fast accessed by file name combined on simple rules. Having large file is a bad idea (until you'll have some kind of index) since you have to seek thru it to find suitable information (e.g. by operation datetime) and seek operation will be considerably longer.
Situation becomes even worse when compression comes to play since you'll have to decompress data to seek thru it or build some kind of index. There are no need to do it yourself, you can enable folder compression in OS and get all compression benefits transparently without any coding.
So, I would suggest don't reinvent wheel (except you really need that, see below):
- Split log data on regular basis, e.g. per-hour to reduce compression performance hit
- Enable OS folder compression
Thats all, you'll reduce your storage space.
To roll on your own (in case of you really want it). You can do the same, split data into chunks, compress each and save in your kind of storage. To implement something similar I would think about following:
- keep one file with raw (uncompressed) data, where you'll log new info;
- keep and update index file, e.g. with stored date ranges per chunk to quickly find file position in compressed data by date;
- keep file for compressed data storage, each chunk in it contains its size and compressed (e.g. with GZipStream) data;
So you'll write info to uncompressed part until some condition, then compress it and tail-add to compressed part, updating index file. Keeping index file as separate allows fast update without rewriting huge compressed part.
Also I would suggest to think why you have so large log files. Probably you can optimize your storage format. E.g. if your logs is text files you can switch to binary format for example building dictionary from original strings and storing just message identifiers instead of full data, i.e.:
updating region 1;
updating region 2;
compressing data;
store as:
x1 1
x1 2
x2
Strings above just example, you can "decompress" them in runtime as needed by remap data back. You can save pretty much space by switching to binary and maybe enough to forget about compression.
I have no ready implementation or algorithm. Maybe others can suggest better, but hope my thoughts will be somewhat helpful.
The reason you basically can't do this the way it's described is that all modern compression algorithms are based on dictionaries that are maintained (added to, removed from) as the compressor moves over the input, and again when it generates the output.
In order to append to a compressed stream (resume compression), you would need the dictionary in the state it had when compression was suspended. Compression algorithms don't persist the dictionary because it would be a waste of space - it's not needed for decompression; it gets built again from the compressed input during the decompression stage.
I would probably split the output in chunks that are compressed separately.