mapreduce

Disco/MapReduce: Using chain_reader on split data

My algorithm currently uses nr_reduces 1 because I need to ensure that the data for a given key is aggregated. To pass input to the next iteration, one should use "chain_reader". However, the results from a mapper are as a single result list, and it appears this means that the next map iteration takes place as a single mapper! Is ther...

Hadoop 0.2: How to read outputs from TextOutputFormat?

My reducer class produces outputs with TextOutputFormat (the default OutputFormat given by Job). I like to consume this outputs after the MapReduce job complete to aggregate the outputs. In addition to this, I like to write out the aggregated information with TextInputFormat so that the output from this process can be consumed by the nex...

Using Hadoop, are my reducers guaranteed to get all the records with the same key?

I'm running a hadoop job (using hive actually) which is supposed to uniq lines in a lot of text file. More specifically it chooses the most recently timestamped record for each key in the reduce step. Does hadoop guarantee that every record with the same key, output by the map step, will go to a single reducer, even if there are many r...

How can I implement MapReduce using shell commands?

How do you execute a Unix shell command (e.g awk one liner) on a cluster in parallel (step 1) and collect the results back to a central node (step 2)? Update: I've just found http://blog.last.fm/2009/04/06/mapreduce-bash-script It seems to do exactly what I need. ...

Configuring Hadoop logging to avoid too many log files

I'm having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like the same problem in this question: http://stackoverflow.com/questions/2091287/error-in-hadoop-mapreduce My question is: does anyone know how to configure Hadoop to roll the log...

Hadoop: Iterative MapReduce Performance

Is it correct to say that the parallel computation with iterative MapReduce can be justified mainly when the training data size is too large for the non-parallel computation for the same logic? I am aware that the there is overhead for starting MapReduce jobs. This can be critical for overall execution time when a large number of iterat...

Hadoop Map/Reduce - simple use example to do the following...

I have MySQL database, where I store the following BLOB (which contains JSON object) and ID (for this JSON object). JSON object contains a lot of different information. Say, "city:Los Angeles" and "state:California". There are about 500k of such records for now, but they are growing. And each JSON object is quite big. My goal is to do ...

Amazon Elastic MapReduce: Exception from FileSystem

Hi, I run my application using ruby client: ruby elastic-mapreduce -j j-20PEKMT9BRSUC --jar s3n://sakae55/lib/edu.cit.som.jar --main-class edu.cit.som.hadoop.SOMDriver --arg s3n://sakae55/repository/input/ecoli/ --arg s3n://sakae55/repository/output/ecoli/pl/ --arg s3n://sakae55/repository/data/ecoli/som.txt Then, I am seeing the follo...

Multiple lines of text to a single map

I've been trying to use Hadoop to send N amount of lines to a single mapping. I don't require for the lines to be split already. I've tried to use NLineInputFormat, however that sends N lines of text from the data to each mapper one line at a time [giving up after the Nth line]. I have tried to set the option and it only takes N lines...

key sorting and mapreduce in couchdb

Hi all, I am making reduce view for counting users notes for date range and I am in front of quite not-solving problem. Date range select is ok - I create array key [YYYY,MM,DD,user_id]. Document example: { "_id": "1", "_rev": "1-84d3fa5f259c7b30c74028ca60a45f91", "body": "note text", "user_id": 666, "created_at": "Thu A...

Is there an implementation of rapid concurrent syntactical sugar in scala? eg. map-reduce

Passing messages around with actors is great. But I would like to have even easier code. Examples (Pseudo-code) val splicedList:List[List[Int]]=biglist.partition(100) val sum:Int=ActorPool.numberOfActors(5).getAllResults(splicedList,foldLeft(_+_)) where spliceIntoParts turns one big list into 100 small lists the numberofactors part, ...

How to use Cassandra's Map Reduce with or w/o Pig?

Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end. https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/ For instance, let's say I'm using Python and Pycassa, how would I load in a new ma...

Database solution for 200million writes/day, monthly summarization queries

Hello. I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.) I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queri...

can i use hadoop cloudera without root access?

a bit of a binary question (okay, not excatly) - but was wondering if one is able to configure cloudera / hadoop to run at the nodes without root shell access to the node computers (although i can setup ssh passwordless login)? appears from their instructions that root access is needed, at yet i found a hadoop wiki which suggest root ac...

Taskid in MapReduce

I am newbie to MapReduce and Java programming. I am trying to get taskid of each map() function. Basically I need to use taskid of each mapper as offset for fetching some data from a common file. Please help me getting taskid of individual map() task. Thanks, Vanamala ...

Is MapReduce one form of Continuation-Passing Style (CPS)?

As the title says. I was reading Yet Another Language Geek: Continuation-Passing Style and I was sort of wondering if MapReduce can be categorized as one form of Continuation-Passing Style aka CPS. I am also wondering how can CPS utilise more than one computer to perform complex computation. Maybe CPS makes it easier to work with Actor...

file to map/reduce program

Hi , I am working on extracting Parts Of speech (POS) using xml documents and I have a englishPCFG.ser.gz file which is used in extracting POS on xml files. I cannot send this .gz file as input in HDFS directory, but my program uses it for parsing xml files. The file is in my local directory. I am getting "File Not Found" error when I ru...

MongoDB: What's the point of using MapReduce without parallelism?

Quoting http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Parallelism As of right now, MapReduce jobs on a single mongod process are single threaded. This is due to a design limitation in current JavaScript engines. We are looking into alternatives to solve this issue, but for now if you want to parallelize your M...

Amazon Elastic MapReduce: Failed to create a job flow with a large number of instances

Hi, Every time I attempt to create a job flow with more than 20 instances, the creation fails. It works for me most of the time with less than 20 instances. Is there any limitation on the number of instances allowed for a job flow? By the way, I use ERM CLI: ruby elastic-mapreduce --create --alive --key-pair key --num-instances 30 ...

Using MongoDB's map/reduce to "group by" two fields

I need something slightly more complex than the examples in the MongoDB docs and I can't seem to be able to wrap my head around it. Say I have a collection of objects of the form {date: "2010-10-10", type: "EVENT_TYPE_1", user_id: 123, ...} Now I want to get something similar to a SQL GROUP BY query, grouping over both date and type. T...