views:

151

answers:

2

Hi, I'd like to begin thinking about how I can scale up my algorithms that I write for data analysis so that they can be applied to arbitrarily large sets of data. I wonder what are the relevant concepts (threads, concurrency, immutable data structures, recursion) and tools (Hadoop/MapReduce, Terracota, and Eucalyptus) to make this happen, and how specifically these concepts and tools are related to each other. I have a rudimentary background in R, Python, and bash scripting and also C and Fortran programming, though I'm familiar with some basic functional programming concepts also. Do I need to change the way that I program, use a different language (Clojure, Haskell, etc.), or simply (or not so simply!) adapt something like R/Hadoop (HRIPE)... or write wrappers for Python to enable multi-threading or Hadoop access? I understand this would might involve requirements for additional hardware and I would like some basic idea of what the requirements/options available might be. My apologies for this rather large and yet vague question, but just trying to get started - thanks in advance!

+3  A: 

The main thing for scaling up to large data is to avoid situations where you're reading huge datasets into memory at once. In pythonic terms this generally means using iterators to consume the dataset in manageable pieces.

SpliFF
I love iterators! Can't stop usin'em.
Stephen
+7  A: 

While languages and associated technologies/frameworks are important for scaling, they tend to pale in comparison to the importance of the algorithms, data structure, and architectures. Forget threads: the number of cores you can exploit that way is just too limited -- you want separate processes exchanging messages, so you can scale up at least to a small cluster of servers on a fast LAN (and ideally a large cluster as well!-).

Relational databases may be an exception to "technologies pale" -- they can really clamp you down when you're trying to scale up a few orders of magnitude. Is that your situation -- are you worried about mere dozens or at most hundreds of servers, or are you starting to think about thousands or myriads? In the former case, you can still stretch relational technology (e.g. by horizontal and vertical sharding) to support you -- in the latter, you're at the breaking point, or well past it, and must start thinking in terms of key/value stores.

Back to algorithms -- "data analysis" cover a wide range... most of my work for Google over the last few years falls in that range, e.g. in cluster management software, and currently in business intelligence. Do you need deterministic analysis (e.g. for accounting purposes, where you can't possibly overlook a single penny out of 8-digit figures), or can you stand some non-determinism? Most "data mining" applications fall into the second category -- you don't need total precision and determinism, just a good estimate of the range that your results can be proven to fall within, with, say, 95% probability.

This is particularly crucial if you ever need to do "real-near-time" data analysis -- near-real-time and 100% accuracy constraints on the same computation do not a happy camper make. But even in bulk/batch off-line data mining, if you can deliver results that are 95% guaranteed orders of magnitude faster than it would take for 99.99% (I don't know if data mining can ever be 100.00%!-), that may be a wonderful tradeoff.

The work I've been doing over the last few years has had a few requirements for "near-real-time" and many more requirements for off-line, "batch" analysis -- and only a very few cases where absolute accuracy is an absolute must. Gradually-refined sampling (when full guaranteed accuracy is not required), especially coupled with stratified sampling (designed closely with a domain expert!!!), has proven, over and over, to be a great approach; if you don't understand this terminology, and still want to scale up, beyond the terabytes, to exabytes and petabytes' worth of processing, you desperately need a good refresher course in Stats 201, or whatever course covers these concepts in your part of the woods (or on iTunes University, or the YouTube offerings in university channels, or blip.tv's, or whatever).

Python, R, C++, whatever, only come into play after you've mastered these algorithmic issues, the architectural issues that go with them (can you design a computation architecture to "statistically survive" the death of a couple of servers out of your myriad, recovering to within statistically significant accuracy without a lot of rework...?), and the supporting design and storage-technology choices.

Alex Martelli
Wow, I'm very honored! I have your Python book... Thanks for your detailed response. My desires are still yet abstract but are dealing with earth system measurements... no real-time requirement at the moment, and 100% precision not required (following the human processing model: quick and approximate. And yes, I'm quite familiar with stratified sampling methods!). So sounds like the algorithm is strongly determined by hardware (naturally), and the first step might be to consider data reduction through representative sampling. More thought needed.
Stephen
@Stephen, yes, starting with a coarse grid and refining it smartly and gradually may work well for you (if the geophysicists agree): I'm personally more comfortable with some randomness in the sampling but it's not a requirement unless there are possible artefacts coming from aspects of non-randomness ("sampling bias") which in geophysics, I suspect, may not arise. Once you start scaling up by message passing, reducing needs for synchronization and communication among nodes (esp. continuous "roundtrips" and waits) becomes paramount, and spatial-grid sampling works well for this!
Alex Martelli