I've been trying to use Hadoop to send N amount of lines to a single mapping. I don't require for the lines to be split already.
I've tried to use NLineInputFormat, however that sends N lines of text from the data to each mapper one line at a time [giving up after the Nth line].
I have tried to set the option and it only takes N lines...
I have a Pig program where I am trying to compute the minimum center between two bags. In order for it to work, I found I need to COGROUP the bags into a single dataset. The entire operation takes a long time. I want to either open one of the bags from disk within the UDF, or to be able to pass another relation into the UDF without ne...
I have a rather simple hadoop question which I'll try to present with an example
say you have a list of strings and a large file and you want each mapper to process a piece of the file and one of the strings in a grep like program.
how are you supposed to do that? I am under the impression that the number of mappers is a result of the ...
I am trying to use Mahout in an application running on Windows. I want to build clusters from a lucene index using k-means.
As soon as I have to create sequence files (creating vectors from a lucene index), I get a Hadoop-Exception, since Hadoop makes command line calls to programs unknown in a Windows environment (e.g. chmod). Running ...
How to read a lucene index directory stored over HDFS i.e. How to get IndexReader for the index stored over HDFS. The IndexReader is to opened in a map task.
Something like: IndexReader reader = IndexReader.open("hdfs/path/to/index/directory");
Thanks,
Akhil
...
a bit of a binary question (okay, not excatly) - but was wondering if one is able to configure cloudera / hadoop to run at the nodes without root shell access to the node computers (although i can setup ssh passwordless login)?
appears from their instructions that root access is needed, at yet i found a hadoop wiki which suggest root ac...
We are running Hadoop on Amazon EC2 cluster. We start the master, slaves and attach the ebs volumes and finally waiting for hadoop jobtracker, tasktracker etc to start and we have timeout of 3600 seconds. We are noticing 50% of the time that job tracker is not able to start before the timeout. Reason being, hdfs is not initialized proper...
Hi,
I am new to hadoop.
I have a file Wordcount.java which refers hadoop.jar and stanford-parser.jar
I am running the following commnad
javac -classpath .:hadoop-0.20.1-core.jar:stanford-parser.jar -d ep WordCount.java
jar cvf ep.jar -C ep .
bin/hadoop jar ep.jar WordCount gutenburg gutenburg1
After executing i am getting the f...
Looking at http://www.nearmap.com/,
Just wondering if you can approximate how much storage is needed to store the images?
(NearMap’s monthly city PhotoMaps are captured at 3cm, 5cm, 7.5cm, or 10cm resolution)
And what kind of systems/architecture is suitable to deliver those data/images?
(say you are not Google, and want to implement t...
Hi,
We are running our cluster on Amazon EC2. we are using cloudera scripts to setup hadoop. On the master node, we start below services.
609 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode'
610 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode'
611 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon...
I am currently trying to perform calculations like clustering coefficient on huge graphs with the help of Hadoop. Therefore I need an efficient way to store the graph in a way that I can easily access nodes, their neighbors and the neighbors' neighbors. The graph is quite sparse and stored in a huge tab separated file where the first fie...
I'm interested in learning techniques for distributed computing. As a Java developer, I'm probably willing to start with Hadoop. Could you please recommend some books/tutorials/articles to begin with?
...
Hi,
We are launching hadoop cluster on amazon ec2 and recently we are having network issues like master unable to connect to slave. We thought the reason is due to amazon throttling the network connections over a limit. So, we tried to establish a connection after a random delay from each slave node. But, that didn't help.
Are there an...
Hi,
In the "syslog" for a MapReduce job flow step, I see the following:
Job Counters
Launched reduce tasks=4
Launched map tasks=39
Does the number of launched map tasks include failed tasks?
I am using NLineInputFormat class as input format to manage the number of map tasks.
However, I get slightly different numbers for exact sa...
I launched a hadoop cluster and submitted a job to the master. The jar file is only contained in the master. Does hadoop ship the jar to all the slave machines at the start of the job? Is there a possibility that slave machine will run with previous version of code shipped during last run?
Thank you
Bala
...
I try to implement Hash join in Hadoop.
However, Hadoop seems to have already a map-side join and a reduce - side join already implemented.
What is the difference between these techniques and hash join?
...
Hi,
This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line.
Now, when we say that we want to run N map tasks, doe...
Hi,
I tried printing out values using System.out.println(), but they won't appear on the console. How do i print out the values in a map/reduce application for debugging purposes using Hadoop?
Thanks,
Deepak.
...
I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space.
I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it...
Hi,
Can someone walk me though the basic work-flow of reading and writing data with classes generated from DDL?
I have defined some struct-like records using DDL. For example:
class Customer {
ustring FirstName;
ustring LastName;
ustring CardNo;
long LastPurchase;
}
I've compiled this to get a Customer class ...