tags:

views:

314

answers:

5
+4  A: 

MapReduce is good for scaling the processing of large datasets, but it is not intended to be responsive. In the Hadoop implementation, for instance, the overhead of startup usually takes a couple of minutes alone. The idea here is to take a processing job that would take days and bring it down to the order of hours, or hours to minutes, etc. But you would not start a new job in response to a web request and expect it to finish in time to respond.

To touch on why this is the case, consider the way MapReduce works (general, high-level overview):

  • A bunch of nodes receive portions of the input data (called splits) and do some processing (the map step)

  • The intermediate data (output from the last step) is repartitioned such that data with like keys ends up together. This usually requires some data transfer between nodes.

  • The reduce nodes (which are not necessarily distinct from the mapper nodes - a single machine can do multiple jobs in succession) perform the reduce step.

  • Result data is collected and merged to produce the final output set.

While Hadoop, et al try to keep data locality as high as possible, there is still a fair amount of shuffling around that occurs during processing. This alone should preclude you from backing a responsive web interface with a distributed MapReduce implementation.

Edit: as Jan Jongboom pointed out, MapReduce is very good for preprocessing data such that web queries can be fast BECAUSE they don't need to engage in processing. Consider the famous example of creating an inverted index from a large set of webpages.

danben
You can however create some MapReduce alghoritm and let it preprocess the data, so in your webapp you can query the optimized datasets. That way you could get responsive queries.
Jan Jongboom
@Jan Jongboom - exactly
danben
That XML file I will get the data, and it will happen once a day. So, it's possible for me to process it and store it in a very digestible format.
Pasha
If you are getting 5GB once a day, you shouldn't need a distributed system for processing.
danben
5GB was just an example, in reality the data might be bigger. Also, parsing a 5GB file is too slow to do in a linear fashion, and it also needs to support a few dozen concurrent users. This is why I started looking into MapReduce and DFS.
Pasha
Ok, then ignore my last comment.
danben
+2  A: 

A distributed implementation of MapReduce such as Hadoop is not a good fit for processing a 5GB XML

  • Hadoop works best on large amounts of data. Although 5GB is a fairly big XML file, it can easily be processed on a single machine.
  • Input files to Hadoop jobs need to be splittable so that different parts of the file can be processed on different machines. Unless your xml is trivially flat, the splitting of the file will be non deterministic so you'll need a pre processing step to format the file for splitting.

If you had many 5GB files, then you could use hadoop to distribute the splitting. You could also use it to merge results across files and store the results in a format for fast querying for use by your web interface as other answers have mentioned.

Robert Christie
A: 

It sounds like what you might want is a good old fashioned database. Not quite as trendy as map/reduce, but often sufficient for small jobs like this. Depending on how flexible your filtering needs to be, you could either just import your 5GB file into a SQL database, or you could implement your own indexing scheme yourself, by either storing records in different files, storing everything in memory in a giant hashtable, or whatever is appropriate for your needs.

Peter Recore
I wish i didn't say 5GB, everyone seems to be latching onto this. The data we will be dealing with eventually is on the order of 100s of GBs a day, and we'll have to process many days of data.
Pasha
Yup, we're latching on because most mapreduce implementations are designed to handle large datasets, not small ones :)
Peter Recore
+3  A: 

MapReduce is a generic term. You probably mean to ask whether a fully featured MapReduce framework with job control, such as Hadoop, is right for you. The answer still depends on the framework, but usually, the job control, network, data replication, and fault tolerance features of a MapReduce framework makes it suitable for tasks that take minutes, hours, or longer, and that's probably the short and correct answer for you.

The MapReduce paradigm might be useful to you if your tasks can be split among indepdent mappers and combined with one or more reducers, and the language, framework, and infrastructure that you have available let you take advantage of that.

There isn't necessarily a distinction between MapReduce and a database. A declarative language such as SQL is a good way to abstract parallelism, as are queryable MapReduce frameworks such as HBase. This article discusses MapReduce implementations of a k-means algorithm, and ends with a pure SQL example (which assumes that the server can parallelize it).

Ideally, a developer doesn't need to know too much about the plumbing at all. Erlang examples like to show off how the functional language features handle process control.

Also, keep in mind that there are lightweight ways to play with MapReduce, such as bashreduce.

Karl Anderson
+2  A: 

I recently worked on a system that processes roughly 120GB/hour with 30 days of history. We ended up using Netezza for organizational reasons, but I think Hadoop may be an appropriate solution depending on the details of your data and queries.

Note that XML is very verbose. One of your main cost will reading/writing to disk. If you can, chose a more compact format.

The number of nodes in your cluster will depend on type and number of disks and CPU. You can assume for a rough calculation that you will be limited by disk speed. If your 7200rpm disk can scan at 50MB/s and you want to scan 500GB in 10s, then you need 1000 nodes.

You may want to play with Amazon's EC2, where you can stand up a Hadoop cluster and pay by the minute, or you can run a MapReduce job on their infrastructure.

Vadim P.