views:

174

answers:

6

We receive multiple thousands of flat files per week currently, and I have a system that runs reports on these and exports them to PDF for our people to process and reference.

I currently bulk load these into a database, make sure all fields/formatting is valid, export them, and truncate the tables on the next run.

What I'm wondering is what everyone thinks would be the most space efficient way to store possibly 6 months of this bulk load plain text data?

Either in the form of daily SQL backups, or zipped archives, or whatever, so I always had the ability to reload old data for trouble shooting.

Any ideas are welcome, I'm open to any suggestions.

Thanks!

+2  A: 

Use a recent generation compression utility (7z and rar compression are great) and compress into bundles after organizing everything so it's easy to find.

There are SDK's for 7zip that work with .net to make this easy.

Adam Davis
Since I would be limited on what software I can install and I haven't used 7zip before, can you easily decompress it with the windows XP default utility? If not I probably can't use this method due to restrictions on what I can actually install.
thismat
No, if that's the case then you can use the 7zip sdk to compress a regular zip file, which windows can decompress without additional software.
Adam Davis
+1  A: 

Construct a file hierarchy that organizes the files appropriately, zip the whole directory, and use the -u flag on zip to add new files.after you archive them, you can delete the files, but preserve the directory structure for the next batch to be added.

If the file names encode the version somehow (dates or whatever) or are otherwise unique it doesn't need to be anything fancier than a signle directory. If not, you need to set up your directories to let you recover versions.

Charlie Martin
The file names are not encoded to be unique, but I can easily create the structure with files named with date stamps, this is a solid solution I think, have you ever tried compressing from .NET without an additional SDK?
thismat
No, I'm not a particular .net guy. It shouldn't be too hard, but working with the standard zip programs provides the update and merge functionality for free.
Charlie Martin
+2  A: 

So, you bulk-load flat files of raw data, you use SQL Server 2005 to process them and get a separate bunch of processed flat files, and then dump the data?

Well, if this is correct, SQL backups won't help since you seem to be saying the data doesn't stay in the DB. Your only option is efficient compression of the input and/or output files coupled with good organization of the batches in directories.

I would recommend an aggressive compression program, that has scheduled batch functionality, but be careful to not get to esoteric with the program you use for the sake of avoiding being locked in to one program...

alphadogg
That is a good point, and a real problem I've considered with special compression types is dependencies on certain platforms, which I hate doing if I can avoid it, plus I have rules to follow and can't be installing special SDKs everywhere.
thismat
The truncating doesn't come until the following day, so backing up at the end of the day or run would preserve what's needed, but I doubt it's a size-efficient solution anyway.
thismat
BTW, I would avoid backing/compressing from SQL Server. It's another point where you lose flexibility to do it differently later.
alphadogg
And, I would avoid stuffing esoteric functionality into SQL Server unless absolutely necessary. I do lots of post-processing either by using scheduling + already-written small programs, or by writing my own stand-alone small programs in .NET.
alphadogg
I'm leaning towards just compressing the files, since I already have a .NET program that runs to do the pull/delete of the files anyway, it would probably be my best bet. Thanks for the input and good suggestions.
thismat
A: 

You've indicated that you'd like to avoid SDKs and installing software on remote systems.

Your options are pretty limited.

Since you are using windows computers, why not use a simple script?

This question offers several suggestions on how to use windows VBscript to compress and decompress files:
http://stackoverflow.com/questions/30211/windows-built-in-zip-compression-script-able

Nothing to 'install', no SDKs. Just copy the script over, call it via the scheduler, and you're all set.

Adam Davis
+2  A: 

There are two types of data post-analysis:

  • original data (usually very big)
  • derived data (usually smaller)

In your case, the derived data might be the data that goes into your reports. For your original data I'd just make a huge, compressed archive file of it with a systematic name based on the date and the type of data. The value of this is that if some newbie on your team somehow totally obliterates the code that imports your original data into the database, you can recover from it. If the derived data is small, you might think about copying that to either another database table, or keeping it in a separate flat file because some of your problems could be solved by just getting to the derived data.

Backing up your data in general is a tricky problem, because it depends on things like:

  • Amount of data throughput
  • Available space for off-site backups
  • Value of upgrading your backup system versus just resigning yourself to regenerating data if problems happen.

What's your setup like? Will hard drives grow fast enough to hold the compressed version of your data? Have you thought about off-site backups?

James Thompson
Since this is a "bottom line" process that isn't server worthy (you'll get a kick out of this....), it's actually running on an old laptop with limited resources.It hasn't had issues yet and has proven to be pretty stable though. I like your idea of copying the derived data into a separate table.
thismat
+1  A: 

Compress them and save them in a binary field in the database. Then you can build a "reload data-set" button to do bring in your dataset (i'm assuming you keep track of each dataset that you import to replace it, etc.)

This way, everything's stored in the database, and backed up with the database, indexed and linked correctly, and compressed at the same time.

Jas Panesar