views:

755

answers:

8

Hi, I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to

  1. write the whole thing in C (or Fortran)
  2. import the files (tables) into a relational database directly and then pull off chunks in R or Python (some of the transformations are not amenable for pure SQL solutions)
  3. write the whole thing in Python

Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks

Edit Thanks for your responses. There seems to be conflicting opinions about Hadoop, but in any case I don't have access to a cluster (though I can use several unnetworked machines)...

+9  A: 

(3) is not necessarily a bad idea -- Python makes it easy to process "CSV" file (and despite the C standing for Comma, tab as a separator is just as easy to handle) and of course gets just about as much bandwidth in I/O ops as any other language. As for other recommendations, numpy, besides fast computation (which you may not need as per your statements) provides very handy, flexible multi-dimensional arrays, which may be quite handy for your tasks; and the standard library module multiprocessing lets you exploit multiple cores for any task that's easy to parallelize (important since just about every machine these days has multi-cores;-).

Alex Martelli
I think I will give Python, NumPy, and multiprocessing a try... Thanks much.
Stephen
Agreed. Python's supposed performance penalty doesn't make much of an appearance in the real world.
Paul McMillan
+6  A: 

Have a look at Disco. It is a lightweight distributed MapReduce engine, written in about 2000 lines of Erlang, but specifically designed for Python development. It supports not only working on your data, but also storing an replication reliably. They've just released version 0.3, which includes an indexing and database layer.

Marcelo Cantos
Thanks - I'll have to keep an eye out on Disco. I couldn't find much documentation on the database layer though, and perhaps the MapReduce framework is not appropriate for my hardware at the moment...
Stephen
Discodb is a poor-man's database. It is basically just a distributed hash table that sits on top of ddfs. I don't know much about it myself, since it's so new.
Marcelo Cantos
+2  A: 

In case you have a cluster of machines you can parallelize your application using Hadoop Mapreduce. Although Hadoop is written in Java it can run Python too. You can checkout the following link for pointers in parallelizing your code - PythonWordCount

Snehal
Thanks ~ don't have cluster of networked machines for this though...
Stephen
+1  A: 

Yes. You are right! I/O would cost most of your processing time. I don't suggest you to use distributed systems, like hadoop, for this task.

Your task could be done in a modest workstation. I am not an Python expert, I think it has support for asynchronous programming. In F#/.Net, the platform has well support for that. I was once doing an image processing job, loading 20K images on disk and transform them into feature vectors only costs several minutes in parallel.

all in all, load and process your data in parallel and save the result in memory (if small), in database (if big).

Yin Zhu
Thanks - I hope so. All I have at my disposal are a few modest workstations (and some time)
Stephen
+4  A: 

With terabytes, you will want to parallelize your reads over many disks anyway; so might as well go straight into Hadoop.

Use Pig or Hive to query the data; both have extensive support for user-defined transformations, so you should be able to implement what you need to do using custom code.

SquareCog
Did not know about Pig or Hive - will keep that in mind...
Stephen
+8  A: 

Ok, so just to be different, why not R?

  • You seem to know R so you may get to working code quickly
  • 30 mb per file is not large on standard workstation with a few gb of ram
  • the read.csv() variant of read.table() can be very efficient if you specify the types of columns via the colClasses argument: instead of guestimating types for conversion, these will handled efficiently
  • the bottleneck here is i/o from the disk and that is the same for every language
  • R has multicore to set up parallel processing on machines with multiple core (similar to Python's multiprocessing, it seems)
  • Should you want to employ the 'embarrassingly parallel' structure of the problem, R has several packages that are well-suited to data-parallel problems: E.g. snow and foreach can each be deployed on just one machine, or on a set of networked machines.
Dirk Eddelbuettel
I think it could be done with R, but part of me feels it's not the right tool? Though with all the extra considerations you've included it could certainly be made to work... I've recently started using data.frame(scan(filename,what=list(...))) to try to speed up the input but specifying the colClasses might just be another good idea. Thanks ~
Stephen
@Stephen Don't forget to check this thread http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r
Marek
Thanks! I've been a long-time R user but this is very helpful
Stephen
Glad to be help. You may also find my ['Intro to HPC with R' tutorial slides](http://dirk.eddelbuettel.com/presentations.html) useful.
Dirk Eddelbuettel
Thanks - yes, I've been aware of your work for some time :)
Stephen
+1  A: 

When you say "accumulating" then solution (2) looks most suitable to problem.
After initial load up to database you only update database with new files (daily, weekly? depends how often you need this).

In cases (1) and (3) you need to process files each time (what was stated earlier as most time/resources-consuming), unless you find a way to stored results and update them with new files.

You could use R to process files from csv to, for example, SQLite database.

Marek
Ah yes, I see what you mean... but once I process each text file I will store the results most likely in an SQLite file and work with these results for post-analysis. The aggregating I do is a sort of time-averaging (this is longitudinal data) so I won't have to re-import previous files when analyzing the current one (which is why MapReduce would make sense I guess...).
Stephen
+4  A: 

I've had good luck using R with Hadoop on Amazon's Elastic Map Reduce. With EMR you pay only for the computer time you use and AMZN takes care of spinning up and spinning down the instances. Exactly how to structure the job in EMR really depends on how your analysis workflow is structured. For example, are all the records needed for one job contained completely inside of each csv or do you needs bits from each csv to complete an analysis?

Here's some resources you might find useful:

The problem I mentioned in my blog post is more one of being CPU bound, not IO bound. Your issues are more IO, but the tips on loading libraries and cachefiles might be useful.

While it's tempting to try to shove this in/out of a relational database, I recommend carefully considering if you really need all the overhead of an RDB. If you don't, then you may create a bottleneck and development challenge with no real reward.

JD Long
Thanks so much - I guess a question remains as to whether it's worth the overhead of transferring the data to their infrastructure!
Stephen