views:

88

answers:

5

I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.

The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.

As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.

Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.

+2  A: 

One of these nosql dbs might work. I highly doubt any are configurable to sit on top of flat, delimited files. You might look at one of the open source projects and write your own database layer.

Stefan Kendall
I guess it would take some major hacking, but I will look around. I am curious about Drizzle and its microkernel architecture - it might be a way.
Roman Zenka
+1  A: 

You can do this with LINQ to Objects if you are in a .NET environment. Streaming/deferred execution, functional programming model and all of the SQL operators. The joins will work in a streaming model, but one table gets pulled in so you have to have a large table joined to a smaller table situation.

The ease of shaping the data and the ability to write your own expressions would really shine in a scientific application.

LINQ against a delimited text file is a common demonstration of LINQ. You need to provide the ability to feed LINQ a tabular model. Google LINQ for text files for some examples (e.g., see http://www.codeproject.com/KB/linq/Linq2CSV.aspx, http://www.thereforesystems.com/tutorial-reading-a-text-file-using-linq/, etc.).

Expect a learning curve, but it's a good solution for your problem. One of the best treatments on the subject is Jon Skeet's C# in depth. Pick up the "MEAP" version from Manning for early access of his latest edition.

I've done work like this before with large mailing lists that need to be cleansed, dedupped and appended. You are invariably IO bound. Try Solid State Drives, particularly Intel's "E" series which has very fast write performance, and RAID them as parallel as possible. We also used grids, but had to adjust the algorithms to do multi-pass approaches that would reduce the data.

Note I would agree with the other answers that stress loading into a database and indexing if the data is very regular. In that case, you're basically doing ETL which is a well understood problem in the warehouseing community. If the data is ad-hoc however, you have scientists that just drop their results in a directory, you have a need for "agile/just in time" transformations, and if most transformations are single pass select ... where ... join, then you're approaching it the right way.

Rob
Thank you, that definitely sounds like a possibility. Sadly, this is running on Linux, but I could possibly borrow the idea from LINQ, while providing my own implementation.
Roman Zenka
+1  A: 

Scalability begins at a point beyond tab-separated ASCII.

Just be practical - don't academicise it - convention frees your fingers as well as your mind.

I wish I could keep a convention, but the requirements are changing, and the data is often coming from hardware and software that changes schema over time. Some of the queries are completely ad-hoc (e.g. somebody comes with a random translation table in Excel and wants to use it as a part of their query).
Roman Zenka
+2  A: 

There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.

I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:

  1. Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
  2. Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).

This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.

You should be able to reduce your storage space and should not need twice as much storage space, as you state.

Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.

Jason S
I usually have floating point numbers of varying precision and strings. The problem is that the data is currently being accessed with many different scripts, legacy and third-party software. The database would provide additional functionality, allowing users to do large queries, but having simple access to the plain text data is important. I would probably need to use a Fuse filesystem that provides an uncompressed view of the data.
Roman Zenka
+1  A: 

I would upvote Jason's recommendation if I had the reputation. My only add is that if you do not store it in a different format like the database Jason was suggesting you pay the parsing cost on every operation instead of just once when you initially process it.

SargeATM
I think I could maybe come up with a hybrid solution - compress only the most space-consuming file types, store only the most time-critical part of data in the database. It is all about the economy - where it pays to trade ease of access and management for speed.
Roman Zenka