mapreduce

MongoDB MapReduce: Global variables within map function instance?

I've written a MapReduce in MongoDB and would like to use a global variable as a cache to write to/read from. I know it is not possible to have global variables across map function instances - I just want a global variable within each function instance. This type of functionality exists in Hadoop's MapReduce so I was expecting it to be t...

Using map/reduce for mapping the properties in a collection

Update: follow-up to MongoDB Get names of all keys in collection. As pointed out by Kristina, one can use Mongodb 's map/reduce to list the keys in a collection: db.things.insert( { type : ['dog', 'cat'] } ); db.things.insert( { egg : ['cat'] } ); db.things.insert( { type : [] }); db.things.insert( { hello : [] } ); mr = db.runComm...

Mongo Map Reduce first time

Hello guys, First time Map/Reduce user here, and using MongoDB. I have a lot of page visit data which I'd like to make some sense of by using Map/Reduce. Below is basically what I want to do, but as a total beginner a Map/Reduce, I think this is above my knowledge! Go through all the pages with visits in the last 30 days, and where ex...

Hadoop 0.18.3 API with java 5

I'm using the Hadoop 0.18.3 version in combination with java 5 and I'm trying to run the WordCount v1.0 example of the http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html. But I get the following error 0/06/10 15:28:10 WARN fs.FileSystem: uri=file:/// javax.security.auth.login.LoginException: Login failed: CreateP...

javax.security.auth.login.LoginException: Login failed

I'm trying to run a hadoop job (version 18.3) on my windows machine but I get the following error: Caused by: javax.security.auth.login.LoginException: Login failed: CreateProcess: bash -c groups error=2 at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250) at org.apache.hadoop.s...

Map Reduce job on Amazon: argument for custom jar

Hi all, This is one of my first try with Map Reduce on AWS in its Management Console. Hi have uploaded on AWS S3 my runnable jar developed on Hadoop 0.18, and it works on my local machine. As described on documentation, I have passed the S3 paths for input and output as argument of the jar: all right, but the problem is the third argume...

Best way to do one-to-many "JOIN" in CouchDB

There are CouchDB documents that are list elements: { "type" : "el", "id" : "1", "content" : "first" } { "type" : "el", "id" : "2", "content" : "second" } { "type" : "el", "id" : "3", "content" : "third" } There is one document that defines the list: { "type" : "list", "elements" : ["2","1"] , "id" : "abc123" } As you can see th...

Is there a way to configure timeout for speculative execution in Hadoop?

I have hadoop job with tasks that are expected to run for significant length of fime (few minues). However hadoop starts speculative execution too soon. I do not want to turn speculative execution completely off but I want to increase duration of time hadoop waits before considering job for speculative execution. Is there a config option...

C#: Farm out jobs to worker processes on a multi-processor machine

Hi there, I have a generic check that needs to be run on ca. 1000 objects. The check takes about 3 seconds. We have a server with 4 processors (and we also have other multi-processor servers in our network) so we would like to create an exe / dll to do the checking and return the results to the "master". Does anyone know of a framework...

In MongoDB, how can I replicate this simple query using map/reduce in ruby?

Hi, So using the regular MongoDB library in Ruby I have the following query to find average filesize across a set of 5001 documents: avg = 0 total = collection.count() Rails.logger.info "#{total} asset creation stats in the system" collection.find().each {|row| avg += (row["filesize"] * (1/total.to_f)) if row["filesize"]} ...

Map Reduce: ChainMapper and ChainReducer

Hi all. I need to split my Map Reduce jar file in two jobs in order to get two different output file, one from each reducers of the two jobs. I mean that the first job has to produce an output file that will be the input for the second job in chain. I read something about ChainMapper and ChainReducer in hadoop version 0.20 (currently ...

Storage of parsed log data in hadoop and exporting it into relational DB

I have a requirement of parsing both Apache access logs and tomcat logs one after another using map reduce. Few fields are being extracted from tomcat log and rest from Apache log.I need to merge /map extracted fields based on the timestamp and export these mapped fields into a traditional relational db ( ex. MySQL ). I can parse and e...

How to use a LotusScript function as a document selection routine

Can we use a lotusscript function as a document selection routine inside view selection formula ? Here is my lotus function which determines the selection criteria Function MyFilter(doc As NotesDocument) as boolean 'very complex filtering function '........ End Function and here is the view selection formula that i want to incorpo...

PHP vs. Other Languages in Hadoop/MapReduce implementations, and in the Cloud generally.

I'm beginning to learn some Hadoop/MapReduce, coming mostly from a PHP background, with a little bit of Java and Python. But, it seems like most implementations of MapReduce out there are in Java, Ruby, C++ or Python. I've looked, and it looks like there are some Hadoop/MapReduce in PHP, but the overwhelming body of the literature se...

Adjacency List structure in HBase

I'm trying to implement the following graph reduction algorithm in The graph is an undirected weighted graph I want to strip away all nodes with only two neighbors and update the weights Have a look at the following illustration: The algorithm shall transform the upper graph into the lower one. Eliminate node 2 and update the weig...

Can I get invidually sorted Mapper outputs from Hadoop when using zero Reducers?

I have a job in Hadoop 0.20 that needs to operate on large files, one at a time. (It's a pre-processing step to get file-oriented data into a cleaner, line-based format more suitable for MapReduce.) I don't mind how many output files I have, but each Map's output can be in at most one output file, and each output file must be sorted. ...

Counting Unique Users using Mapreduce for Java Appengine

I'm trying to count the number of unique users per day on my java appengine app. I have decided to use the mapreduce framework (mapreduce.appspot.com) for java appengine to do this calculation offline. I've managed to create a map reduce job that goes through all of my entities which represent a single users session event. I can use a si...

How to combine multiple Hadoop MapReduce Jobs into one?

I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input. My goal: Compute these different tasks as fast as possible. I currently let them run sequentially each reading in all the data. I assume it ...

Sequence file name being used as key in Hadoop output?

I'm trying to use Dumbo/Hadoop to calculate TF-IDF for a bunch of small text files using this example http://dumbotics.com/2009/05/17/tf-idf-revisited/ To improve efficiency, I've packaged the text files into a sequence file using Stuart Sierra's tool -- http://stuartsierra.com/2008/04/24/a-million-little-files The sequence file uses m...

Reduce-Side join in MapReduce

Hi everyone, Can anyone give some illustration on how to write a program for reduce-side join? The reduce-side join provided by Hadoop is sort-merge join. How can I write a hash-join algorithm for reduce-side join? Best ...