tags:

views:

91

answers:

4

I'm a physicist that normally deals with large amounts of numerical data generated using C programs. Typically, I store everything as columns in ASCII files, but this had led to massively large files. Given that I am limited in space, this is an issue and I'd like to be a little smarter about the whole thing. So ...

  1. Is there a better format than ASCII? Should I be using binary files, or perhaps a custom format some library?

  2. Should I be compressing each file individually, or the entire directory? In either case, what format should I use?

Thanks a lot!

+4  A: 
  1. If you need the files for a longer time, they are important experimental data that prove somethings for you or so, don't use binary formats. You will not be able to read them when your architecture changes. dangerous. stick to text (yes ascii) files.

  2. Choose a compression format that fits your needs. Is compression time an issue? Usually not, but check that for yourself. Is decompression time an issue? Usually yes, if you want to do data analysis on it. Under these conditions I'd go for bzip2. This is quite common nowadays, well tested, foolproof. I'd do files individually, since the larger your file, the larger the probability of losses. (Bits flip etc).

Jens Gustedt
Alex Martelli
-1, text format is horrible for storing numerical data, it's next to impossible to reproduce float numbers as a string.
aaa
@aaa: Actually, it's not that hard. The question has come up on SO before, but the easy answer is to dump it as hex.
Steven Sudit
@Steven how is it much better than binary dump? you still have to accommodate byte order
aaa
@aaa: That was the easy answer. Here's the rest: http://stackoverflow.com/questions/3215235/how-do-you-print-the-exact-value-of-a-floating-point-number
Steven Sudit
Also C99 has `%a` format specifier for `*printf()` functions, which takes care of writing + reading problem.
Alok
A conformant implementation of `printf` and `strtod` will ensure round-trip exactness of floating point numbers printed as decimal, as long as you print sufficiently many decimal digits. If you're worried and have C99 at your disposal, using `%a` is a much nicer way to handle it.
R..
+2  A: 

A terabyte disk is a hundred bucks. Hard to run out of space these days. Sure, storing the data in binary saves space. But there's a cost, you'll have a lot less choices to get the data out of the file again.

Check what your operating system can do. Windows supports automatic compression on folders for example, the file content get zipped by the file system without you having to do anything at all. Compression rates should compete well with raw binary data.

Hans Passant
Agreed, although I'd suggest RAR.
Steven Sudit
A: 

There's a lot of info you didn't include, but should think about:

1.) Are you storing integers or floats? What is the typical range of the numbers? For example: storing small comma-separated integers in ascii, such as "1,2,4,2,1" will average 2-bytes per datum, but storing them as binary would require 4-bytes per datum.

If your integers are typically 3 digits, then comma-separated vs binary won't matter much.

On the other hand, storing doubles (8-byte values) will almost certainly be smaller in binary format.

2.) How do you need to access these values? If you are not concerned about access time, compress away! On the other hand, if you need speedy, random access then compression will probably hinder you.

3.) Are some values frequently repeated? Then you may consider a Huffman encoding or a table of "short-cut" values.

abelenky
+6  A: 

In your shoes, I would consider the standard scientific data formats, which are much less space- and time-consuming than ASCII, but (while maybe not quite as bit-efficient as pure, machine-dependent binary formats) still offer standard documented and portable, fast libraries to ease the reading and writing of the data.

If you store data in pure binary form, the metadata is crucial to make any sense out of the data again (are these numbers single or double precision, or integers and of what length, what are the arrays' dimensions, etc, etc), and issues with archiving and retrieving paired data/metadata pairs can, and in practice do, occasionally make perfectly good datasets unusable -- a real pity and waste.

CDF, in particular, is "a self-describing data format for the storage and manipulation of scalar and multidimensional data in a platform- and discipline-independent fashion" with many libraries and utilities to go with it. As alternatives, you might also consider NetCDF and HDF -- I'm less familiar with those (and such tradeoffs as flexibility vs size vs speed issues) but, seeing how widely they're used by scientists in many fields, I suspect any of the three formats could give you very acceptable results.

Alex Martelli
I worked on a high-performance data acquisition system about 3 years ago and NetCDF was an absolute nightmare. It was "there" when I came in, so it could have possibly been setup wrong. All I know is we replaced it asap. YMMV.
JustBoo