views:

71

answers:

4

Hello,

We are now required by law to digitalize all the financial documents in our company and submit them to evaluations every 3 months.

Since this is sensitive data we decided to take matters into our own hands and build some sort of digital data archiver. The tool works perfectly, but after 7 months of usage we are begining to worry about the disk space used by these images.

Here some info on the amount of documents digitalized:

  • 15K documents scanned and archived per day, with final PNG size of +- 860KB: 15 000 * 860 kilobits = 1.53779984 gigabytes
  • 30 days of work per month: 1.53779984 gigabytes * 30 = 46.1339952 gigabytes
  • Expectation of disk space usage after 1 year: 46.1339952 gigabytes * 12 = 553.607942 gigabytes

So far we're at 424 gigabytes of disk space used, without counting backup. We're using PNG as image format, but I would like to know if anyone have any advice on a better compression algorithm for images or alternative strategies for compressing the PNG's even more or even better ways to archive images as to save disk space.

Any help would be appreciated, thanks.

+1  A: 

this documents, are they black & white or color?

Esse
Adabada
+2  A: 

Presumably these documents don't need to all be online constantly. If that is the case, from the information you've provided, I don't see any reason why you'd need to change your workflow.

PNG is a widely-supported format with lossless (zlib) compression, which I'm guessing you're using. If you don't need lossless compression, good ole JPEG will give you tighter compression at the expense of minor quality loss, provided you tweak the compression ratios appropriately. JPEG2000 may be another alternative, depending on your software stack. LZW-compressed TIFF offers no major advantages over PNG other than 16-bit-per-pixel support, which you probably don't need. Other options include proprietary specialty codecs (like MrSID) that offer extremely good compression of extremely large files, for a price.

Since these are scanned documents, I guess I would think of PDF as the "natural" format in which to encode them. PDF offers a variety of compression options depending on the contents of the files. But I wouldn't go to great lengths to fix something that isn't broken.

If you think about how much you're spending on drive space now, 1.5 GB per day is nothing. Drive space is cheap and constantly getting cheaper. Just buy three new 1 TB USB drives (primary / backup / offsite backup) every 6 months at a total cost of $240 or whatever. Even tape backup is not unreasonable.

alexantd
+2  A: 

You'll be better off with DjVu, a relatively new format that was designed expressly to compress scanned documents. It works well for bitonal, grayscale, and color documents. It combines foreground/background separation with a sophisticated wavelet compression scheme. If you get the commercial version I believe you can also get your documents OCR'd so you can search them, but there is a completely open-source version called DjVuLibre.

Norman Ramsey
What an annoying website! All the detailed doc is in djvu format. Someone needs a 2x4 upside the head.
ergosys
Norman Ramsey
A: 

500 Gb per year is not much, and hard drives are getting cheaper each year

zed_0xff