views:

57

answers:

3

I need a solution that allows me to create compressed data files (gzip, zip. tar etc. - any format could work) and then freely append data to them without having to load the whole file in memory and re-compress it (seeking while decompressing would be awesome as well). Anyone has a suggestion on .NET?

A: 

Have you seen GZipStream class? You can use it as any other stream.

Andrew Bezzub
doesn't work with appending to existing streams. Well technically it does but whet you try to read the info back it reads only the first chunk and then acts like it reaches EOF even if there is more data in there.
AZ
it could work if i find a way to reliable read the subsequent chunks
AZ
+1  A: 

Maybe I have some suggestions for you.

First of all, why you seek programmatic solution implemented on your own?

You can simply split large log file into chunks, i.e. on per-hour or even per-minute basis and gather them in separate dictionary on per-day basis (in order to not clutter FS with huge number of files in one directory), so instead having considerable large file you have to process and seek, you'll have many small files, which can be fast accessed by file name combined on simple rules. Having large file is a bad idea (until you'll have some kind of index) since you have to seek thru it to find suitable information (e.g. by operation datetime) and seek operation will be considerably longer.

Situation becomes even worse when compression comes to play since you'll have to decompress data to seek thru it or build some kind of index. There are no need to do it yourself, you can enable folder compression in OS and get all compression benefits transparently without any coding.

So, I would suggest don't reinvent wheel (except you really need that, see below):

  • Split log data on regular basis, e.g. per-hour to reduce compression performance hit
  • Enable OS folder compression

Thats all, you'll reduce your storage space.


To roll on your own (in case of you really want it). You can do the same, split data into chunks, compress each and save in your kind of storage. To implement something similar I would think about following:

  • keep one file with raw (uncompressed) data, where you'll log new info;
  • keep and update index file, e.g. with stored date ranges per chunk to quickly find file position in compressed data by date;
  • keep file for compressed data storage, each chunk in it contains its size and compressed (e.g. with GZipStream) data;

So you'll write info to uncompressed part until some condition, then compress it and tail-add to compressed part, updating index file. Keeping index file as separate allows fast update without rewriting huge compressed part.


Also I would suggest to think why you have so large log files. Probably you can optimize your storage format. E.g. if your logs is text files you can switch to binary format for example building dictionary from original strings and storing just message identifiers instead of full data, i.e.:

updating region 1;

updating region 2;

compressing data;

store as:

x1 1

x1 2

x2

Strings above just example, you can "decompress" them in runtime as needed by remap data back. You can save pretty much space by switching to binary and maybe enough to forget about compression.

I have no ready implementation or algorithm. Maybe others can suggest better, but hope my thoughts will be somewhat helpful.

Nick Martyshchenko
thanks for the suggestions. I'm aware of all the other options i have for my situation. Nevertheless few of them are feasible as the system in question is huge and there are many interconnected parts and tools that use the current log infrastructure. So in this case changing the log format or re-architecting the structure of the files is not an option as that will impact to much application code. The least intrusive option in my view would be to tackle the problem at the read/write level and be transparent to the rest of the code. So my original question still stands.
AZ
If there is no way of achieving this then of course i'll have to look for other options
AZ
I think second suggestion (go for your own storage format with compressed chunks) is less intrusive and can be implemented in your situation. IMHO using stream compression algorithms is not a good idea. They can't help you much since its goal is somewhat different. You'll need to implement your own way and better stick with something known and fast enough such as GZip
Nick Martyshchenko
+2  A: 

The reason you basically can't do this the way it's described is that all modern compression algorithms are based on dictionaries that are maintained (added to, removed from) as the compressor moves over the input, and again when it generates the output.

In order to append to a compressed stream (resume compression), you would need the dictionary in the state it had when compression was suspended. Compression algorithms don't persist the dictionary because it would be a waste of space - it's not needed for decompression; it gets built again from the compressed input during the decompression stage.

I would probably split the output in chunks that are compressed separately.

500 - Internal Server Error