mapreduce

Where do I start with distributed computing?

I'm interested in learning techniques for distributed computing. As a Java developer, I'm probably willing to start with Hadoop. Could you please recommend some books/tutorials/articles to begin with? ...

Amazon Elastic MapReduce: the number of launched map task

Hi, In the "syslog" for a MapReduce job flow step, I see the following: Job Counters Launched reduce tasks=4 Launched map tasks=39 Does the number of launched map tasks include failed tasks? I am using NLineInputFormat class as input format to manage the number of map tasks. However, I get slightly different numbers for exact sa...

Hadoop : Code shipped from master to slave

I launched a hadoop cluster and submitted a job to the master. The jar file is only contained in the master. Does hadoop ship the jar to all the slave machines at the start of the job? Is there a possibility that slave machine will run with previous version of code shipped during last run? Thank you Bala ...

MapReduce programming system in java-actionscript

Just finished reading ch23 in the excellent 'Beautiful Code' http://oreilly.com/catalog/9780596510046 on Distributed Programming with MapReduce. I understand that MapReduce is a programming system designed for large-scale data processing problems, but I have a hard time getting my head around the basic examples given and how I might app...

How does Hadoop perform input splits?

Hi, This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line. Now, when we say that we want to run N map tasks, doe...

Debugging hadoop applications

Hi, I tried printing out values using System.out.println(), but they won't appear on the console. How do i print out the values in a map/reduce application for debugging purposes using Hadoop? Thanks, Deepak. ...

Hadoop/MapReduce: Reading and writing classes generated from DDL

Hi, Can someone walk me though the basic work-flow of reading and writing data with classes generated from DDL? I have defined some struct-like records using DDL. For example: class Customer { ustring FirstName; ustring LastName; ustring CardNo; long LastPurchase; } I've compiled this to get a Customer class ...

Do you know of any python mapreduce ready clustering libraries?

Do you know of any python mapreduce ready clustering libraries? I have found some good libraries in Java (http://lucene.apache.org/mahout/), I'd prefer to use python though. http://wiki.github.com/klbostee/dumbo/ (Python mapreduce API ) Edit --- I'm looking for mapreduce ready : Canopy, K-means, Means-shift,etc.. ...

I'm familiar with Python and its data structures. Can someone give me a very basic example on how to use Hadoop Mapreduce?

What can I do with Mapreduce? Dictionaries? Lists? What do I use it for? Give a real easy example ...

What's the best way to count unique visitors with Hadoop?

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this... DATE siteID action username 05-05-2010 siteA pageview jim 05-05-2010 siteB pageview tom 05-05-2010 siteA pageview jim 05-05-2010 siteB pageview bob 05-05-2010 siteA ...

Global variables in hadoop.

Hi, My program follows a iterative map/reduce approach. And it needs to stop if certain conditions are met. Is there anyway i can set a global variable that can be distributed across all map/reduce tasks and check if the global variable reaches the condition for completion. Something like this. While(Condition != true){ C...

MongoDB map/reduce counts

The output from MongoDB's map/reduce includes something like 'counts': {'input': I, 'emit': E, 'output': O}. I thought I clearly understand what those mean, until I hit a weird case which I can't explain. According to my understanding, counts.input is the number of rows that match the condition (as specified in query). If so, how is it ...

Getting started with massive data

I'm a math guy and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I n...

hadoop mapper static initialisation

Hi, I have a code fragment in which I am using a static code block to initialize a variable. public static class JoinMap extends Mapper<IntWritable, MbrWritable, LongWritable, IntWritable> { ....... public static RTree rt = null; static { String rtreeFileName = "R.rtree"; rt...

Mongodb MapReduce to concatenate strings?

All of the MongoDB MapReduce examples I have seen have dealt with counting/adding numbers. I need to combine strings, and it looks like MapReduce is the best tool for the job. I have a large MongoDB collection in this format: {name: userone, type: typeone} {name: usertwo, type: typetwo} {name: userthree, type: typeone} Each name only ...

Alternative to a large database

Hi, I am having a database with tables having billions of rows in a single table for a month and I am having data for the past 5 years. I tried to optimize the data in all possible ways, but the latency is not decreasing. I know there are some solutions like using horizantal shrading and vertical shrading. But I am not sure about any op...

MapReduce in the cloud

Except for Amazon MapReduce, what other options do I have to process a large amount of data? Thank you! ...

fft algorithm implementation with hadoop

Hi, I want to implement Fast Fourier Transform algorithm with Hadoop. I know recursive-fft algorithm but I need your guideline in order to implement it Map/Reduce approach. Any suggestions? Thanks. ...

Need help implementing this algorithm with map Hadoop MapReduce

Hi all! i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my...

0.20.2 API hadoop version with java 5

I have started a maven project trying to implement the MapReduce algorithm in java 1.5.0_14. I have chosen the 0.20.2 API hadoop version. In the pom.xml i'm using thus the following dependency: < dependency> < groupId>org.apache.hadoop< /groupId> < artifactId>hadoop-core< /artifactId> < version>0.20.2< /version> < /depend...