mapreduce

How do I get the values from the counter after I processed all the records with Google AppEngine MapReduce?

How do I get the values from the counter after I processed all the records with Google AppEngine MapReduce? Or am I missing the use case for counters here? Sample Code from http://code.google.com/p/appengine-mapreduce/wiki/UserGuidePython How would I retrieve the value of counter counter1 when the mapreduce is done? app.yaml handler...

manupulating iterator in mapreduce

I was trying to find the sum of any given points using hadoop, but my problem is on getting all values from a given key in a single reducer. It is some thing like this. I have this reducer public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator<IntWritable> values, ...

On demand slave generation in Hadoop cluster on EC2

Hi, I am planning to use Hadoop on EC2. Since we have to pay per instance usage, it is not good to have fixed number of instances than what are actually required for the job. In our application, many jobs are executed concurrently and we do not know the slave requirement all the time. Is it possible to start the hadoop cluster with mini...

Where do I begin learning Lucene.NET Solr Hadoop and MapReduce?

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1) Where do I start learning, which products should I consider? To be hon...

MultipleOutputFormat in hadoop

Hi. I'm a newbie in Hadoop. I'm trying out the Wordcount program. Now to try out multiple output files, i use MultipleOutputFormat. this link helped me in doing it. http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html in my driver class i had MultipleOutputs.addNamedOutput(conf, "...

Start Amazon Elastic MapReduce Job remotely?

I'm working on a small project to get myself acquainted with the Amazon web services. I'm trying to make a simple web application; when a button is pressed a mapreduce job is launched and the output is returned on the browser. What would be the best way to do this? Also, is there a way to launch an amazon elastic mapreduce job via the co...

MapReduce skipping keys?

I'm running a local, single-system test using Qizmt of a simple MapReduce operation. At the end of the 'Map' phase I am calling: output.Add(rKey, rValue); This is called let's say a million times, and the keys are 1,2,3,4,5,6 etc - each unique (I'm just testing, after all). I've checked that this is happening as intended. It is. The f...

Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?

I'm exploring the options for running a hadoop application on a local system. As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as...

MapReduce Nutch tutorials

Hi, Could some one give me pointers to tutorials that explains how to write a mapreduce program into Nutch? Thank you. ...

Embeddable open-source key-value storage with liberal license

Is there any open-source document-oriented key-value map/reduce storage that: is easily embeddable (Yes, it is possible to embed, let's say CouchDB, but it might be a pain to take the whole erlang machine onboard and I just don't feel good about it bounded on some port when my app is running) does not keep the whole map in RAM (Hello, ...

mongodb mapreduce returning inconsistent results

I have a super simple map reduce test... that isn't working consistently. In a nutshell, I'm just looking for duplicate records. I have a collection that has: GiftIdea - site_id - site_key the site_id + site_key should be unique, but currently isn't. So I have the following map reduce code: var map = function() { print(this.s...

Getting started with MapReduce/Hadoop

Hi, Lately, i have reading a lot about MapReduce/Hadoop and think this is where industry is currently moving to. I want to start learning MapReduce/Hadoop and i thought the best way to start would be to implement some small project. However, i tried to do some googling, but couldnt find anything. Can you guys give me some links or ma...

Sorting large data using MapReduce/Hadoop

Hi, I am reading about MapReduce and the following thing is confusing me. Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows: Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and...

Simple way to storing data from multiple processes

I have a Python script that does something along the line of: def MyScript(input_filename1, input_filename2): return val; i.e. for every pair of input, I calculate some float value. Note that val is a simple double/float. Since this computation is very intensive, I will be running them across different processes (might be on the s...

Adding multiple files to Hadoop distributed cache?

Hi, I am trying to add multiple files to hadoop distributed cache. Actually I don't know the file names. They will be named like part-0000*. Can someone tell me how to do that? Thanks Bala ...

multi computer map-reduce in C#

Is there a simple Map-Reduce library or implementation for .NET that allows a task to start on one computer and be split amongst multiple worker computers, perhaps using WCF or something else bit more efficient to manage the inter machine communication? I looked at Microsoft's Dryad but from the docs it seems it is more intended for lo...

Creating a pagination index in CouchDB?

Hello, I'm trying to create a pagination index view in CouchDB that lists the doc._id for every Nth document found. I wrote the following map function, but the pageIndex variable doesn't reliably start at 1 - in fact it seems to change arbitrarily depending on the emitted value or the index length (e.g. 50, 55, 10, 25 - all start with ...

Map/Reduce in CouchDB with multiple parameters?

I am wondering how to use CouchDB's map/reduce with multiple parameters. For example, if I have teams that have players with ages and genders, I assume I would do this for my map function: "function(doc){ if(doc.team_name) { emit(doc.team_name, doc); } }" However, I am unsure how to write a reduce function to get the oldest m...

Wildcards in Views with CouchDB?

Is it possible to use wildcards in views in CouchDB? For example, lets say I have a database that has teams, ages of players, players' averages, and gender of players. However, the players' ages may not be known - they could be from the Dominican Republic or whatnot. So I want to use a view with a map function that can accept not havi...

Google App Engine : use mapreduce to empty datastore

I am trying to use an early experimental release of mapper implementation to empty the datastore. This solution was proposed in a similar SO question. This is the AppEngineMapper I am currently using. It just deletes the entity. public class EmptyFixesMapper extends AppEngineMapper<Key, Entity, NullWritable, NullWritable> { publi...