mapreduce

MongoDB - how I turn this group() query to map/reduce

I have a collection where each document looks like this {access_key:'xxxxxxxxx', keyword: "banana", count:12, request_hour:"Thu Sep 30 2010 12:00:00 GMT+0000 (UTC)"} {access_key:'yyyyyyyyy', keyword: "apple", count:25, request_hour:"Thu Sep 30 2010 12:00:00 GMT+0000 (UTC)", } ..... To achieve this: SELECT keyword, sum(count) FROM ke...

MongoDB Stored Procedure Equivalent

Hello, I have a large CSV file containing a list of stores, in which one of the field is ZipCode. I have a separate MongoDB database called ZipCodes, which stores the latitude and longitude for any given zip code. In SQL Server, I would execute a stored procedure called InsertStore which would do a look up on the ZipCodes table to get ...

MapReduce returns NaN

Hi, I have a M/R function, and I get NaN as a value for some of the results. I dont have any experience with JS. Im escaping JS using Java Drivers. String map = "function(){" + " emit({" + "country: this.info.location.country, " + "industry: this.info.industry}, {count : 1}); }"; String reduce = "function(key, ...

Map Reduce count number of documents in each minute MongoDB

I have a MongoDB collection which has a created_at stored in each document. These are stored as a MongoDB date object e.g. { "_id" : "4cacda7eed607e095201df00", "created_at" : "Wed Oct 06 2010 21:22:23 GMT+0100 (BST)", text: "something" } { "_id" : "4cacdf31ed607e0952031b70", "created_at" : "Wed Oct 06 2010 21:23:42 GMT+0100 (BST)",...

Does CouchDB really split views across servers?

I'm currently delving into CouchDB, and I am puzzled by the distribution of Map-Reduce computations in views. I see a lot of resources mentioning that Map-Reduce is inherently distributed, because you can process one half of your data on server A, the other half on server B, and then reduce both results. One example would be slide 16 of ...

Pipeling hadoop map reduce jobs

Hi, I have five map reduce that I am running each separately. I want to pipeline them all together. So, output of one job goes to next job. Currently, I wrote shell script to execute them all. Is there a way to write this in java? Please provide an example. Thanks ...

MongoDB: Terrible MapReduce Performance

Hi all, I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long. I have a database table in MySQL that tracks the number of member profile views for each day. For testing it has 10,000,000 rows. C...

Using Hadoop map/reduce for programming language design course project

I need to design an exercise for my students in programming language design, My idea is help them to learn ideas in lisp, ML and other functional languages by force them to implement a mapreduce exercise with hadoop. Is here any suggestion that help me detail my idea? ...

Can Amazon Auto Scaling Service work with Elastic Map Reduce Service?

Hi, since amazon web service need to pay, so just wanna ask ppl who had worked on it before i jump into it, and confirm some knowledge about it. Question one: In Amazon auto scaling service, it says can scale up and down instances. that does this mean? does it mean changing the type of instance? or can start/stop more/less instance bas...

Hadoop last map job stuck - Need help

Hi, I am doing some text processing using hadoop map-reduce jobs. My job is 99.2% complete and stuck on last map job. The last few lines of the map output show as below. Last time, when this problem occured, I tried printing out the key values emmited from map and noticed that one of the key is having large number of values associated...

Using AppEngine-MapReduce on Google App Engine, what is the easiest way to analyze entities for a specific date range?

I am trying to use AppEngine-MapReduce. I understand how to perform an operation over all entities of some entity_kind, but what is the easiest way to only operate on entities over a data range when the entity has a date attribute? Is there a simple way to pass parameters to the mapper? For example, what if I only wanted to delete entit...

Help with running Taste Grouplens demo on hadoop

Hi All, I am trying to build a collaborative filtering based Recommendation System as part of an academic project. I think Mahout project has a lot of potential and I want to use it. I installed, Mahout, hadoop and Java on my ubuntu 10.1. Hadoop and Java have been checked to be working fine together. (Ran the Hadoop word count example ...

In the python version of Google App Engine mapreduce, how do you access counters from the done_callback?

I am using Google App Engine mapreduce to analyze some data. I am generating a few counters that I would like to create a simple Google chart from in my done_callback. How do I access the resulting counters from the callback? #The map method def count_created_since(entity): now = datetime.datetime.now() delta = now-entity.created ...

What is the simplest way of parallelization over a cluster with SSH and NFS?

I have a lot of trivially parallelizable computations and a lot (100s) of cores distributed overs SSH + NFS network. What is the simplest way of parallelization. The problem is that I don't know how long each task will take so I need some kind of queue. Is there something that is very easy to use? ...

Hadoop Pipes: how to pass large data records to map/reduce tasks

Hello I'm trying to use map/reduce to process large amounts of binary data. The application is characterized by the following: the number of records is potentially large, such that I don't really want to store each record as a separate file in HDFS (I was planning to concatenate them all into a single binary sequence file), and each rec...

Strategies for keeping Map Reduce results around for subsequent queries

I'm using Map Reduce with MongoDB. Simplified scenario: There are users, items and things. Items include any number of things. Each user can rate things. Map reduce is used to calculate the aggregate rating for each user on each item. It's a complex formula using the ratings for each thing in the item and the time of day - it's not ...

How to use Ruby CLI client to launch a JobFlow based on a JSON JobFlow description on Amazon Elastic MapReduce.

I have written a mapreduce application for hadoop and tested it at the command line on a single machine. My application uses two steps Map1 -> Reduce1 -> Map2 -> Reduce2 To run this job on aws mapreduce, I am following this link http://aws.amazon.com/articles/2294. But I am not clear how to use Ruby CLI client provide by amazon to do al...