I'm getting into parallel programming and I'm studying mapreduce and other distributed algorithms. Is it best just to learn mapreduce or is there a more general algorithm that will serve me better?
...
I am asking this as I am wondering if it could be efficient to run mapreduce queries over a database or a shared keyvalue store?
For example, to implement a web trawler, which indexes the internet and counts all the terms on different web pages, could this be done efficiently with a database as a backend?
...
Could mapreduce be used to implement a webserver?
I'm thinking something like when a request comes in then the request sits on a queue, until a server is free to process it? Or am I missing the point here?
...
I am looking at using a MapReduce system to serve web pages and I have seen that load balancers are already used for distributing web page requests. Is there any reason that a map reduce system, Hadoop for example could not do this?
...
For the purposes of history on wikipedia, is anyone familiar with the origin of the phrase "embarrassingly parallel". I've always thought that it may have been coined by a random Google employee who first worked on map-reduce. Does anyone have any concrete info on the origin?
...
I have been learning the mapreduce algorithm and how it can potentially scale to millions of machines, but I don't understand how the sorting of the intermediate keys after the map phase can scale, as there will be:
1,000,000 x 1,000,000
: potential machines communicating small key / value pairs of the intermediate results with each ...
I'm trying to find out how I can iterate over the final results of a map reduce operation, so I guess there must be some sort of index into the map reduce results?
...
Map Reduce is a pattern that seems to get a lot of traction lately and I start to see it manifest in one of my projects that is focused on an event processing pipeline (iPhone Accelerometer and GPS data). I needed to built a lot of infrastructure for this project, in fact it overweighs the logic code interacting with it by 2x. Some of th...
Hello everyone,
I am looking for a map/reduce function to calculate the status in a Design Document.
Below you can see an example document from my current database.
{
"_id": "0238f1414f2f95a47266ca43709a6591",
"_rev": "22-24a741981b4de71f33cc70c7e5744442",
"status": "retrieved image urls",
"term": "Lucas Winter",
"urls"...
Hi,
I am running a hadoop streaming job. It got stuck due to no reason. I am not sure how to cancel the task, so that hadoop schedules another task for the same job. I tried killing the job, but it still doesn't work. Anyone know, how to do this?
Thank you
Bala
...
My Couchdb database as a main document type that looks something like:
{
"_id" : "doc1",
"type" : "main_doc",
"title" : "the first doc"
...
}
There is another type of document that stores user information. I want users to be able to tag documents as favorites. Different users can save the same or different documents as fav...
In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps.
I.e. Map1 , Reduce1 , Map2 , Reduce2 , etc.
So you have the output from the last reduce that is needed as the input for the next map.
The intermediate data is something you (in general) do not want to keep once the pipel...
I am testing jobs in EMR and each and every test takes a lot of time to start up. Is there a way to keep the server/master node alive in Amazon EMR? I know this can be done with the API. But, I wanted to know if this can be done in the aws console?
...
I want to write a map/reduce job to select a number of random samples from a large dataset based on a row level condition. I want to minimize the number of intermediate keys.
Pseudocode:
for each row
if row matches condition
put the row.id in the bucket if the bucket is not already large enough
Have you done something like th...
Hi,
I need some ideas for a weekend project about Hadoop and OpenStreetMap.
I have access to AWS EC2 instance with OpenStreetMap snapshot in my EBS volume.
The OpenStreetMap data is in a PostgreSQL database.
What kind of MapReduce function can be run on the OpenStreetMap data, assuming I can export them into xml format, and then place...
I'm trying to get the eclipse plugin for hadoop development to work, I'm using hadoop 0.18.3. I installed the old MapReduce plugin (http://www.alphaworks.ibm.com/tech/mapreducetools) on Eclipse v3.5.2 (M20100211-1343) by copying it to /Applications/eclipse/plugins and restarting eclipse but that didn't work, I figured it was because it w...
It's a bit complicated to explain but here we go. Basically, the issue is "How to break up problems into subproblems in an efficient way". "Efficient" here means, the broken up subproblems are as big as possible. Basically, it would be ideal if I don't have to break up the problems at all. However, because an worker can only work on spec...
What is the key difference between Fork/Join and Map/Reduce?
Do they differ in the kind of decomposition and distribution (data vs. computation)?
...
I am trying to parallel a classic map-reduce problem (which can parallel well with MPI) with OpenCL, namely, the AMD implementation. But the result bothers me.
Let me brief about the problem first. There are two type of data that flow into the system: the feature set (30 parameters for each) and the sample set (9000+ dimensions for each...
Currently am implementing PageRank on Disco. As an iterative algorithm, the results of one iteration are used as input to the next iteration.
I have a large file which represents all the links, with each row representing a page and the values in the row representing the pages to which it links.
For Disco, I break this file into N chun...