hadoop

Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?

I'm exploring the options for running a hadoop application on a local system. As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as...

Merging multiple files into one within Hadoop

Hello, I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig? Thanks! ...

Can I set task memory limit higher than 2GB

Hadoop map-reduce configuration provides the mapred.task.limit.maxvmem and mapred.task.default.maxvmem. According to the documentation both of these are values of type long that is anumber, in bytes, that represents the default/upper VMEM task-limit associated with a task. It appears that meaning of "long" in this context is 32bit and se...

How do you use a custom comparator with SingleColumnValueFilter on HBase?

I am trying to filter rows from a HBase table using two SingleColumnValueFilter objects to bring back all records that fall within a range of long values for the column. According to the documentation for the SingleColumnValueFilter, it does a lexicographic compare of the column value unless you pass it your own comparator. The api shows...

Hadoop ToolRunner fails with NoClassDefFoundError

I am brand new to Linux, Java, and Hadoop. I have a created a simple MapReduce Driver that implements the Tool interface. But when I try to run the job in Eclipse, I get a NoClassDefFoundError before the run() method is invoked. I am running Hadoop 0.20.2 on Ubuntu 10.04 LTS. The source code and stack trace are provided below. Any...

Hadoop run with java class question. Cannot make it run successfully

I am following Hadoop-the definitive Guide book. confused on an example, (example 3-1) there is a java file, (class URLCat) I use javac to make it into URLCat.class then use jar to make it into a jar. The book said % hadoop URLCat hdfs://localhost/user/tom/quangle.txt to run it... But I have tried a lot of different way..such as ...

Getting started with MapReduce/Hadoop

Hi, Lately, i have reading a lot about MapReduce/Hadoop and think this is where industry is currently moving to. I want to start learning MapReduce/Hadoop and i thought the best way to start would be to implement some small project. However, i tried to do some googling, but couldnt find anything. Can you guys give me some links or ma...

Sorting large data using MapReduce/Hadoop

Hi, I am reading about MapReduce and the following thing is confusing me. Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows: Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and...

Variable/looping sequence of jobs

I'm considering using hadoop/mapreduce to tackle a project and haven't quite figured out how to set up a job flow consisting of a variable number of levels that should be processed in sequence. E.g.: Job 1: Map source data into X levels. Job 2: MapReduce Level1 -> appends to Level2 Job 3: MapReduce Level2 -> appends to LevelN Job N: Ma...

Adding multiple files to Hadoop distributed cache?

Hi, I am trying to add multiple files to hadoop distributed cache. Actually I don't know the file names. They will be named like part-0000*. Can someone tell me how to do that? Thanks Bala ...

does anyone find Cascading for Hadoop Map Reduce useful?

I've been trying Cascading, but I cannot see any advantage over the classic map reduce approach for writing jobs. Map Reduce jobs gives me more freedom and Cascading seems to be putting a lot of obstacles. Might make a good job for making simple things simple, but complex things.. I find them extremely hard Is there something I'm mi...

Libraries/Tools for Website Parsing

I would like to start working with parsing large numbers of raw HTML pages into semantic data structures. Just interested in the community opinion on various available tools for such a task, particularly various useful libraries in any language. So far, planning on using Hadoop to manage a lot of the processing, but curious about alter...

Hadoop and 3d Rendering of images

Hi, I have to make a project Distributed rendering of a 3d image. I can use standard algorithms. The aim is to learn hadoop and not image processing. So can any one suggest what language should I use c++ or java and some standard implementation of a 3d renderer. Any other help would be highly useful .. ...

Idle hadoop master - how to make it do some work?

Hi, I have launched a small cluster of two nodes and noticed that the master stays completely idle while the slave does all the work. I was wondering what is the way to let master run some of the tasks. I understand that for a larger cluster having a dedicated master may be necessary but on a 2-node cluster it seems an overkill. Thanks...

How to copy files from HDFS to S3 effectively programatically

Hi, My hadoop job generate large number of files on HDFS and I want to write a separate thread which will copy these files from HDFS to S3. Could any one point me to any java API that handles it. Thanks ...

Hadoop- Basic question regarding input to the mapper function

Hi, We can provide input files to the mapper as FileInputFormat.setInputPaths(conf, inputPath); Is it possible to pass a reference to memory say a DOM tree constructed using a DOM parser after parsing an xml file as an input to mapper function of Hadoop framework. What other possibilities are there ? Thanks, L ...

Hadoop task schedulers: Capacity vs Fair sharing or something else?

Background My employer is progressively shifting our resource intensive ETL and backend processing logic from MySQL to Hadoop ( dfs & hive ). At the moment everything is still somewhat small and manageable ( 20 TB over 10 nodes ) but we intend to progressively increase the cluster size. Now that hadoop is being shifted into production...

Create a hadoop jar with external dependencies using Gradle

How do I create a hadoop jar that includes all dependencies in the lib folder using Gradle? Basically, similar to what fatjar does. ...

hadoop + one key to every reducer

Is there a way in Hadoop to ensure that every reducer gets only one key that is output by the mapper ? ...

hadoop + Writable interface + readFields throws an exception in reducer

I have a simple map-reduce program in which my map and reduce primitives look like this map(K,V) = (Text, OutputAggregator) reduce(Text, OutputAggregator) = (Text,Text) The important point is that from my map function I emit an object of type OutputAggregator which is my own class that implements the Writable interface. However, my red...