hadoop

Multiple lines of text to a single map

I've been trying to use Hadoop to send N amount of lines to a single mapping. I don't require for the lines to be split already. I've tried to use NLineInputFormat, however that sends N lines of text from the data to each mapper one line at a time [giving up after the Nth line]. I have tried to set the option and it only takes N lines...

How can I load a file into a DataBag from within a Yahoo PigLatin UDF?

I have a Pig program where I am trying to compute the minimum center between two bags. In order for it to work, I found I need to COGROUP the bags into a single dataset. The entire operation takes a long time. I want to either open one of the bags from disk within the UDF, or to be able to pass another relation into the UDF without ne...

Having two sets of input combined on hadoop

I have a rather simple hadoop question which I'll try to present with an example say you have a list of strings and a large file and you want each mapper to process a piece of the file and one of the strings in a grep like program. how are you supposed to do that? I am under the impression that the number of mappers is a result of the ...

How to use Mahout in a Windows environment?

I am trying to use Mahout in an application running on Windows. I want to build clusters from a lucene index using k-means. As soon as I have to create sequence files (creating vectors from a lucene index), I get a Hadoop-Exception, since Hadoop makes command line calls to programs unknown in a Windows environment (e.g. chmod). Running ...

opening lucene index stored in hdfs

How to read a lucene index directory stored over HDFS i.e. How to get IndexReader for the index stored over HDFS. The IndexReader is to opened in a map task. Something like: IndexReader reader = IndexReader.open("hdfs/path/to/index/directory"); Thanks, Akhil ...

can i use hadoop cloudera without root access?

a bit of a binary question (okay, not excatly) - but was wondering if one is able to configure cloudera / hadoop to run at the nodes without root shell access to the node computers (although i can setup ssh passwordless login)? appears from their instructions that root access is needed, at yet i found a hadoop wiki which suggest root ac...

Hadoop on Amazon EC2 : Job tracker not starting properly

We are running Hadoop on Amazon EC2 cluster. We start the master, slaves and attach the ebs volumes and finally waiting for hadoop jobtracker, tasktracker etc to start and we have timeout of 3600 seconds. We are noticing 50% of the time that job tracker is not able to start before the timeout. Reason being, hdfs is not initialized proper...

Classnotfound exception while running hadoop

Hi, I am new to hadoop. I have a file Wordcount.java which refers hadoop.jar and stanford-parser.jar I am running the following commnad javac -classpath .:hadoop-0.20.1-core.jar:stanford-parser.jar -d ep WordCount.java jar cvf ep.jar -C ep . bin/hadoop jar ep.jar WordCount gutenburg gutenburg1 After executing i am getting the f...

Nearmap architecture

Looking at http://www.nearmap.com/, Just wondering if you can approximate how much storage is needed to store the images? (NearMap’s monthly city PhotoMaps are captured at 3cm, 5cm, 7.5cm, or 10cm resolution) And what kind of systems/architecture is suitable to deliver those data/images? (say you are not Google, and want to implement t...

Hadoop safemode recovery - taking lot of time

Hi, We are running our cluster on Amazon EC2. we are using cloudera scripts to setup hadoop. On the master node, we start below services. 609 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode' 610 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode' 611 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon...

Efficient way to store a graph for calculation in Hadoop

I am currently trying to perform calculations like clustering coefficient on huge graphs with the help of Hadoop. Therefore I need an efficient way to store the graph in a way that I can easily access nodes, their neighbors and the neighbors' neighbors. The graph is quite sparse and stored in a huge tab separated file where the first fie...

Where do I start with distributed computing?

I'm interested in learning techniques for distributed computing. As a Java developer, I'm probably willing to start with Hadoop. Could you please recommend some books/tutorials/articles to begin with? ...

Amazon EC2 - network issues

Hi, We are launching hadoop cluster on amazon ec2 and recently we are having network issues like master unable to connect to slave. We thought the reason is due to amazon throttling the network connections over a limit. So, we tried to establish a connection after a random delay from each slave node. But, that didn't help. Are there an...

Amazon Elastic MapReduce: the number of launched map task

Hi, In the "syslog" for a MapReduce job flow step, I see the following: Job Counters Launched reduce tasks=4 Launched map tasks=39 Does the number of launched map tasks include failed tasks? I am using NLineInputFormat class as input format to manage the number of map tasks. However, I get slightly different numbers for exact sa...

Hadoop : Code shipped from master to slave

I launched a hadoop cluster and submitted a job to the master. The jar file is only contained in the master. Does hadoop ship the jar to all the slave machines at the start of the job? Is there a possibility that slave machine will run with previous version of code shipped during last run? Thank you Bala ...

Hadoop's Map-side join implements Hash join?

I try to implement Hash join in Hadoop. However, Hadoop seems to have already a map-side join and a reduce - side join already implemented. What is the difference between these techniques and hash join? ...

How does Hadoop perform input splits?

Hi, This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line. Now, when we say that we want to run N map tasks, doe...

Debugging hadoop applications

Hi, I tried printing out values using System.out.println(), but they won't appear on the console. How do i print out the values in a map/reduce application for debugging purposes using Hadoop? Thanks, Deepak. ...

Java or Python distributed compute job (on a student budget)?

I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it...

Hadoop/MapReduce: Reading and writing classes generated from DDL

Hi, Can someone walk me though the basic work-flow of reading and writing data with classes generated from DDL? I have defined some struct-like records using DDL. For example: class Customer { ustring FirstName; ustring LastName; ustring CardNo; long LastPurchase; } I've compiled this to get a Customer class ...