views:

326

answers:

5

I have a variety of time-series data stored on a more-or-less georeferenced grid, e.g. one value per 0.2 degrees of latitude and longitude. Currently the data are stored in text files, so at day-of-year 251 you might see:

251
 12.76 12.55 12.55 12.34 [etc., 200 more values...]
 13.02 12.95 12.70 12.40 [etc., 200 more values...]
 [etc., 250 more lines]
252
 [etc., etc.]

I'd like to raise the level of abstraction, improve performance, and reduce fragility (for example, the current code can't insert a day between two existing ones!). We'd messed around with BLOB-y RDBMS hacks and even replicating each line of the text file format as a row in a table (one row per timestamp/latitude pair, one column per longitude increment -- yecch!).

We could go to a "real" geodatabase, but the overhead of tagging each individual value with a lat and long seems prohibitive. The size and resolution of the data haven't changed in ten years and are unlikely to do so.

I've been noodling around with putting everything in NetCDF files, but think we need to get past the file mindset entirely -- I hate that all my software has to figure out filenames from dates, deal with multiple files for multiple years, etc.. The alternative, putting all ten years' (and counting) data into a single file, doesn't seem workable either.

Any bright ideas or products?

A: 

I'd definitely change from text to binary but keep each day in a separate file still. You could name them in such a way that insertions in between don't cause any strangeness with indices, such as by including the date and possible time in the filename. You could also consider the file structure if you have several fields per location for example. Is it common to look for a small tile from a large number of timesteps? In that case you might want to store them as tiles containing data from several days. You didn't mention how the data is accessed which plays a big role in how to organise it efficiently.

jjrv
Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary. Small tile/many timesteps is the single most common use, in fact a single-point time series is the #1.I am more interested in optimizing *my* time than the CPU's, though exec speed is good!
A: 

Clarifications:

I'm surprised you added "database" as one of the tags, and considered it as an option. Why did you do this?

Essentially, you have a 2D, single component floating point image at every time step. Would you agree with this way of viewing your data?

You also mentioned the desire to insert a day between two existing ones - which seems to be a very odd thing to do. Why would you need to do that? Is there a new day between May 4 and May 5 that I don't know about?

Is "compression" one of the things you care about, or are you just sick of flat files?

Would a float or a double be sufficient to store your data, or do you feel you need more arbitrary precision?

Also, what programming language(s) do you want to access this data with?

Matt Cruikshank
Why database?* Datasets too big to fit in memory w/o writing my own file I/O code* Query lang instead of writing code (e.g., date range for a 10x10 given spatial sub-array)* Leverage 40 yrs of DB optimizationLang: Java Ruby MATLABInsertion: Sometimes a day is missing, only to be added later.
[and more]Or we might recalc one day's data and replace it.A float is more than sufficient -- 5 sig figs is plenty for everything we do.Re languages: I'm doing a lot of work in Ruby now, but need to support old Java, work in MATLAB, and even accommodate FORTRAN routines!Good questions Matt!
A: 

your answer on how to store the data depends entirely on what you're going to do with the data. for example, if you only ever need to retrieve by specifying the date or a date range, then storing in a database as a BLOB makes some sense. but if you need to find records that have certain values, you'll need to do something different.

please describe how you need to be able to access the data/

longneck
Primary use case: Single-point time series. Might need to search for values (e.g. outliers) or run stats on one timestep's raster (e.g., "what's the std deviation/spatial variability from day to day?").
+1  A: 

I've assembled your comments here:

  1. I'd like to do all this "w/o writing my own file I/O code"
  2. I need access from "Java Ruby MATLAB" and "FORTRAN routines"

When you add these up, you definitely don't want a new file format. Stick with the one you've got.

If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)

You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."

My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...

You said something that I want to address:

"leverage 40 yrs of DB optimization"

Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.

Here's the most useful thing I can tell you, based on everything you've told us. You said this:

"I am more interested in optimizing my time than the CPU's, though exec speed is good!"

This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.

And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)

Matt Cruikshank
A: 

Matt, thanks very much, and likewise longneck and jirv.

This post was partly an experiment, testing the quality of stackoverflow discourse. If you guys/gals/alien lifeforms are representative, I'm sold.

And on point, you've clarified my thinking considerably. Mind, I still might not necessarily implement your advice, but know that I will be thinking about it very seriously. >;-)

I may very well leave the file format the same, add to the extant C and/or Ruby routines to tack on the few low-level features I lack (e.g. inserting missing timesteps), and hang an HTTP front end on the whole thing so that the data can be consumed by whatever box needs it, in whatever language is currently hoopy. While it's mostly unchanging legacy software that construct these data, we're always coming up with new consumers for it, so the multi-language/multi-computer requirement (gee, did I forget that one?) applies to the reading side, not the writing side. That also obviates a whole slew of security issues.

Thanks again, folks.