mapreduce

File Processing with Elastic MapReduce - No Reducer Step?

I have a large set of text files in an S3 directory. For each text file, I want to apply a function (an executable loaded through bootstrapping) and then write the results to another text file with the same name in an output directory in S3. So there's no obvious reducer step in my MapReduce job. I have tried using NONE as my reducer,...

Hadoop Map Reduce: Algorithms

Can someone point me to a good web site with good collection of Hadoop algorithms. For example, the most complex thing that I can do with Hadoop right now is Page Rank. Other than that, I can do trivial things like word counting and stuff. I want to see a web site that show me other usage of hadoop. Thanks! ...

App Engine - Task Queue Retry Count with Mapper API

Hi, here is what I'm trying to do: I set up a MapReduce job with the new Mapper API. This basically works fine. The problem is that the Task Queue retries all tasks that have failed. But actually I don't want him to do that. Is there a way to delete a task from the queue or tell it that the task was completed successfully? Perhaps pass...

hadoop on vmware, namenode not finding slaves

I set up 3 identical linux (CentOS) servers on Vmware. Basically built one and made 2 fully clones. I edit each servers hostnames : server1, server2,server3 and added them to each other hosts. Worked with ssh and enabled passwordless ssh. server1 # ssh server2 server2 # So this works. Formatted the dfs on the namenode. started the d...

CouchDB view composing JSON objects with embedded arrays from two separated documents

Lets say I have two types of documents stored in my CouchDB database. First is with property type set to contact and second to phone. Contact type document have another property called name. Phone type have properties number and contact_id so that it can reference to contact person. This is trivial one to many scenario where one contact ...

What is the computational complexity of the MapReduce overhead

Given that the complexity of the map and reduce tasks are O(map)=f(n) and O(reduce)=g(n) has anybody taken the time to write down how the Map/Reduce intrinsic operations (sorting, shuffling, sending data, etc.) increases the computational complexity? What is the overhead of the Map/Reduce orchestration? I know that this is a nonsense wh...

mapreduce distance calculation in hadoop

Hi, Is there a distance calculation implementation using hadoop map/reduce. I am trying to calculate a distance between a given set of points. Looking for any resources .. //edited ............ This is a very intelligent solution. I have tried some how like the first algorithm, and i get almost what i was looking for. I am not concer...

custom word count using hadoop

Hi all, I'm a beginer in hadoop. I've understood the WordCount program. Now I have a problem. I dont want the output of all the words.. - Words_I_Want.txt - hello echo raj - Text.txt - hello eveyone. I want hello and echo count output should be hello 2 echo 1 raj 0 Now that was an exaple, My actual data is very large. ...

Running a standalone Hadoop application on multiple CPU cores.

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output. Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared". When I run...

Why does DistributedCache mangle my file names.

Hi I have a weird problem, DistributedCache appears to change the names of my files, it uses the original name as the parent folder and adds the file as a child. i.e. folder\filename.ext becomes folder\filename.ext\filename.ext Any ideas, my code is below. Thanks Akintayo String paramsLocation="/user/fwang/settings/ecgparams.txt"; D...

Bundling jars when submitting map/reduce jobs via Pig?

I'm trying to combine Hadoop, Pig and Cassandra to be able to work on data stored in Cassandra by means of simple Pig queries. Problem is I can't get Pig to create Map/Reduce jobs that actually work with the CassandraStorage. What I did is I copied the storage-conf.xml file from one of my cluster machines on top of the one in contrib/pi...

Running MRToolkit hadoop jobs on AWS elastic map/reduce

Hi All, Loving MRToolkit -- great to get away from Java while writing Hadoop jobs. It has become apparent that the library was written to interface with an EC2 cluster, and not with Amazon's elastic map/reduce system. Does anybody have insights into running jobs defined using the toolkit on elastic map/reduce servers? It isn't readil...

Is there a canonical problem that provably can't be aided with map/reduce?

I'm trying to understand the boundaries of hadoop and map/reduce and it would help to know a non-trivial problem, or class of problems, that we know map/reduce can't assist in. It certainly would be interesting if changing one factor of the problem would allow simplification from map/reduce. Thank you ...

How to implement references in map-reduce databases?

I am starting to study map-reduce databases. How can one implement a reference in a map-reduce database, such as CouchDB or MongoDB? For example, suppose that I have drivers and cars, and I want to mark that some driver drives a car. In SQL it's something like: SELECT person_id, car_id FROM driver, car WHERE driver.car = car.car_id (T...

mongodb map function not called after upgrade to 1.6

I have an example to learn map-reduce in mongo, modified from: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-ShellExample2 // for timestamps function padZero(number, length) { var str = '' + number; while (str.length < length) { str = '0' + str; } return str; } function printLog(l) { print(l.ts + '...

in mongodb, sharded collections do not accept scope in mapReduce?

I am experimenting with mongodb 1.6 and this thing is new to me. I notice that if I shard a collection, and then do a mapReduce, mapReduce doesn't accept the argument scope anymore: // some line in some example code... res = db.data.mapReduce(m,r,{scope: {log : log, padZero: padZero}}); And I got an error like this: Mon Aug 09 1...

Can DistributedCache copy an entire folder of files

Hi, I am using hadoop, and working on my map and reduce tasks, where I need to be able to access configuration files that are stored in a named folder. I also have a folder with jars that map must call using process builder, as these jars don't belong to me. At this point I am storing the files in nfs, where all the hadoop nodes can ac...

How do I use Elastic MapReduce to run an XSLT transformation on millions of small S3 xml files?

More specifically, is there a somewhat easy streaming solution? ...

Requiring external libraries in ruby streaming scripts for Amazon EMR

How do I require external libraries when running Amazon EMR streaming jobs written in Ruby? I've defined my mapper, and am getting this output in my logs: /mnt/var/lib/hadoop/mapred/taskTracker/jobcache/job_201008110139_0001/attempt_201008110139_0001_m_000000_0/work/./mapper_stage1.rb: line 1: require: command not found My ...

How to keep the sequence file created by map in hadoop

Hi I am using hadoop and working with a map task that creates files that I want to keep, currently I am passing these files through the collector to the reduce task. The reduce task then passes these files on to its collector, this allows me to retain the files. My question is how do I reliably and efficiently keep the files created by...