mapreduce

MapReduce Distributed Cache

Hi, I am adding a file to distributed cache of Hadoop using Configuration cng=new Configuration(); JobConf conf = new JobConf(cng, Driver.class); DistributedCache.addCacheFile(new Path("DCache/Orders.txt").toUri(), cng); where DCache/Orders.txt is the file in HDFS. When I try to retrieve this file from the cache in c...

Parititioned Data Map/Reduce

Hello everyone, I have written my custom partitioner for partitioning datasets. I want to partition two datasets using the same partitioner and then in the next mapreduce job, I want each mapper to handle the same partition from the two sources and perform some function such as joining etc. How I can I ensure that one mapper gets the sp...

how to find the id of each map task?

Hello I want to get the id of each mapper and reducer task because I want to tag the output of these mappers and reducers according to the mapper and reducer id. How can I retrieve the ids of each? Thanks ...

Tools for optimizing scalability of an Hadoop application?

I'm working with a team of mine on a small application that takes a lot of input (logfiles of a day) and produces useful output after several (now 4, in the future perhaps 10) map-reduce steps (Hadoop & Java). Now I've done a partial POC of this app and ran it on 4 old desktops (my Hadoop test cluster). What I've noticed is that if you ...

Compact way of representing all valid "rows" in a tic-tac-toe grid

I've been writing tic-tac-toe in a variety of languages as an exercise, and one pattern that has emerged is that every representation I've come up with for the defining valid winning rows has been disappointingly hard-coded. They've generally fallen into two categories: First, the board is represented as a one- or two-dimensional array...

getting the partition number in a reducer to which a key-value pair belongs

Hello everyone, when i am processing a given key-{set of values} pair in reducer function, how can I get the partition number to which this key-{set of values} belong to? How is it possible to get this partition number without adding extra information about the partition number with each key-value pair during partitioning? Cheers ...

How to rename the output of reduce output file to the partition number

Hi Everyone, I am having some problem with naming the output file of each reduce task with the partition number. How am I going to name the output file with that partition number? I have looked to the MultipleTextOutputFormat. It can generate a new file with the name of my choice for each key. But I want to name the output file for each ...

Which class that looks like MapWritable can be used as the Key in a Hadoop MapReduce program?

I'm writing a program in Java using Hadoop 0.18. Now I would like to use a Map (HashMap/TreeMap/...) kind of datastructure as the Key in my Map Reduce processing. I haven't yet been able to find an official Hadoop class that is essentially an MapWritableComparable (i.e. implements Map, Writable and Comparable). So for my first tests I ...

Map-Side Join Algorithm for MapReduce

Hi Everyone, I am trying to use the Hadoop's Map-side join using CompositeInputFormat but I get an IOException: "Unmatched ')'". I guess there may be a problem in the format of my input file. I have formatted the input files manually in such a way that keys are in sorted order in both the input files. Is this correct or do I have to pas...

Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)

I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs. ...

Large scale Machine Learning

I need to run various machine learning techniques on a big dataset (10-100 billions records) The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian methods, bootstrapping, gradient boosting, regression trees -- many different problems an...

Suggestions for a scalable architecture solution to large data problem

Hi folks, I am in the process of building/architecting a business social network web application that has a component that I think will lead to major scalability issues and I'd like to get some feedback/thoughts on the best way forward. The application has a User object. The idea is, that every time a new user joins the system he ranks...

Partitioner without Mapper

I have, essentially a series of reduce jobs I am running on a lot of data using Hadoop Streaming. I am not really using my Mappers for anything, so am just using Identity Mappers, but I do need the default partitioner hadoop is giving me to group my data in a different manner for each step of my MR job.. I don't know enough the system we...

mongodb multi-stage mapreduce

I'm writing a mapreduce script for my mongodb database. The computation requires two mapreduce stages. Currently I write to an output collection and then run the second stage on that collection. Is it possible to chain mapreduce jobs together without having to manually specify the output collection? ...

Which key class is suitable for secondary sort?

In Hadoop you can use the secondary-sort mechanism to sort the values before they are sent to the reducer. The way this is done in Hadoop is that you add the value to sort by to the key and then have some custom group and key compare methods that hook into the sorting system. So you'll need to have a key that consists essentially of bo...

MongoDB MapReduce returning no data in PHP

Hi all, I'm using a Mongo MapReduce to perform a word-count operation on a bunch of documents. The documents are very simple (just an ID and a hash of words): { "_id" : 6714078, "words" : { "my" : 1, "cat" : 1, "john" : 1, "likes" : 1, "cakes" : 1 } } { "_id" : 6715298, "words" : { "jeremy" : 1, "kicked" : 1, "the" : 1, "ball" : 1 } } ...

query multiple date ranges for a certain date range

hello, i have the problem with MapReduce and complex date ranges: i have database entries like this: { name: x, ranges: [ {from: 2010-1-1, to: 2010-1-15}, {from: 2010-1-17, to:2010-1-20}, ] } ... now i want to query which documents fit into the range: 2010-1-10 to 2010-1-18. i am totally stuck on this because every couchdb exampl...

Text search question about implementation

Hi, Can someone explain me how the text searching algorithm works? I understand its a huge field but am trying to understand it from high level so that I can look up academic papers on it. For example, Spelling mistakes is one problem that is tough to solve and of course Google solves it. When I search for a term and misspell it on Goo...

Hadoop and Eclipse

Hi, i'm trying to implement PageRank algorithm on Hadoop platform with Eclipse, but I'm facing some unusual problems :). I tried locally: installed cygwin, set up Hadoop 0.19.2 (and 0.18.0), started the necessary daemons and installed Eclipse 3.3.1. I uploaded testinf .txt file and then tried to run the WordCount example or even a simpl...

Hadoop in windows : file not found exception

Hi.. I'm using hadoop in windows and i've configured everything good (installing cygwin, passwordless ssh etc..) I've compiled the wordcount program in WC.jar and tried to run. Its running perfectly in standalone mode.. but in fully distributed mode it gives FileNotFoundException Please look into the logs and tel me what is wrong wit...