hadoop

Learning open source technologies

I am good at coding in .NET and I build some my own websites when I am free. Also I want to start a small organization for software development. In recent days I have found so many open source technologies like NOSQL, HBase, MapReduce and other stuff that support to build scalable applications. Is it good to learn completely new techno...

Hadoop last map job stuck - Need help

Hi, I am doing some text processing using hadoop map-reduce jobs. My job is 99.2% complete and stuck on last map job. The last few lines of the map output show as below. Last time, when this problem occured, I tried printing out the key values emmited from map and noticed that one of the key is having large number of values associated...

Ad Hoc Reports Hadoop

Hey guys, I want to allow people to put in simple text search terms, run a pig job(if that's best? it's what I know best) and output the results (the tsv file results?) so I can show them in a web interface. Is there anything that approaches this problem? Any thing known to link a few disjointed pieces of the flow I am going for, toget...

Hadoop WordCount example - Implementing Sorting

I'm a Hadoop newbie. I have been able to successfully run the WordCount example. I would like to modify this example such that my output is sorted in ascending order of count. I'm unable to figure out where I would need to make the necessary changes. It would be great if someone would give me some direction to implement sorting? ...

Making sense of R-Hive, Elastic MapReduce, RHIPE and Distrubted Text mining with R

After having learned about MapReduce for solving a computer vision problem for my recent internship at Google, I felt like an enlightened person. I had been using R for text mining already. I wanted to use R for large scale text processing and for experiments with topic modeling. I started reading tutorials and working on some of those. ...

Help with running Taste Grouplens demo on hadoop

Hi All, I am trying to build a collaborative filtering based Recommendation System as part of an academic project. I think Mahout project has a lot of potential and I want to use it. I installed, Mahout, hadoop and Java on my ubuntu 10.1. Hadoop and Java have been checked to be working fine together. (Ran the Hadoop word count example ...

Hadoop Pipes: how to pass large data records to map/reduce tasks

Hello I'm trying to use map/reduce to process large amounts of binary data. The application is characterized by the following: the number of records is potentially large, such that I don't really want to store each record as a separate file in HDFS (I was planning to concatenate them all into a single binary sequence file), and each rec...

Running Hadoop examples halt in Pseudo-Distributed mode

Every thing run well in Standalone mode and when going to the pseudo-distributed mode, the HDFS works well, I can put files to HDFS and browse it. And I also checked that there is one DataNode in the live nodes lists. However, when I run bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+', the program just halt there wit...

Hadoop streaming and AMAZON EMR

I have been attempting to use Hadoop streaming in AMAZON EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words). The mapper is...

Is Hadoop a good open-source project to join?

I've been learning Java for the last 2 months with a Core Java book. Now I want to write something real, but at first I decided that I need to improve my knoweledge about algorithms and data sturctures so I'm currently reading a book on that. I want to join an open-source project which is mature enough to learn from it but is still gro...

Similarity join using Hadoop

I'm new to hadoop. I'd like to run some approaches with you that I came up with. Problem: 2 datasets : A and B. Both datasets represent songs: some top level attributes, titles (1..), performers (1..). I need to match these datasets either using equality or fuzzy algorithms (such as levenshtein , jaccard, jaro-winkler, etc) based on ...

Hadoop Streaming in .NET

Hello All, I am running hadoop in Pseudo-Distributed mode and using hadoop streaming to do my map-reduce operations. But the problem is I keep getting Streaming Job Failed error message. Following is the log: stderr logs java.io.IOException: Cannot run program "input/StdInOut.exe": CreateProcess error=2, The system cannot find the fil...

How to use Ruby CLI client to launch a JobFlow based on a JSON JobFlow description on Amazon Elastic MapReduce.

I have written a mapreduce application for hadoop and tested it at the command line on a single machine. My application uses two steps Map1 -> Reduce1 -> Map2 -> Reduce2 To run this job on aws mapreduce, I am following this link http://aws.amazon.com/articles/2294. But I am not clear how to use Ruby CLI client provide by amazon to do al...

How to execute mahout with hadoop installation

Hi guys, i'm trying to figure out how to run mahout jar examples with hadoop. I configured mahout and hadoop, now i enter in the hadoop dir and type something like this: /Users/hadoop/hadoop-0.20.2/bin/hadoop jar /Users/hadoop/trunk/examples/mahout-examples-0.5-SNAPSHOT-job.jar org.apache.mahout.SpareVectorsFromSequenceFile -w -i rating...