hadoop

Hadoop Data Persistance in which format?

Hi, I have some experience with Lucene, I'm trying to understand how the data is actually stored in slave server in Hadoop framework? Do we create an index in Slave Server with set of attributes to describe Document we are storing? how does it works in reality ? Thanks R ...

When is it an overkill to use Hadoop?

I have an Oracle database (roughly 1.2 billion records) of data with a web application sitting on top of it that generates queries (generates SQL code and returns counts). Basically you generated SQL queries graphically through an AJAX UI...and it runs pretty nice performance-wise. This is roughly a 400 GB database. I've been looking at...

Hadoop streaming grep does not work

Grep seems not to be working for hadoop streaming For: hadoop jar /usr/local/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar -input /user/root/tmp2/user.data -output /user/root/selected_data -mapper '/bin/grep 1938678460' -reducer 'wc' -jobconf mapred.output.compress=false I get: java.lang.RuntimeException: PipeMapRed.wait...

Hadoop MR source: HDFS vs HBase. Benefits of each?

If I understand the Hadoop ecosystem correctly, I can run my MapReduce jobs sourcing data from either HDFS or HBase. Assuming the previous assumption is correct, why would I choose one over the other? Is there a benefit of performance, reliability, cost, or ease of use to using HBase as a MR source? The best I've been able to find is th...

Using Mapreduce to map multiple unique values not always present on the same lines

I have run into a complex problem with Mapreduce. I am trying to match up 2 unique values that are not always present together in the same line. Once I map those out, I need to count the total number of unique events for that mapping. The log files I am crunching are 100GB+ uncompressed and has data broken into 2 parts that I need to ...

How do I use Avro to process a stream that I cannot seek?

I am using Avro 1.4.0 to read some data out of S3 via the Python avro bindings and the boto S3 library. When I open an avro.datafile.DataFileReader on the file like objects returned by boto it immediately fails when it tries to seek(). For now I am working around this by reading the S3 objects into temporary files. I would like to be a...

How to run a Hadoop program?

I have set up Hadoop on my laptop and ran the example program given in the installation guide successfully. But, I am not able to run a program. rohit@renaissance1:~/hadoop/ch2$ hadoop MaxTemperature input/ncdc/sample.txt output Exception in thread "main" java.lang.NoClassDefFoundError: MaxTemperature Caused by: java.lang.ClassNotFoun...

Pig's Stream Through PHP

I have a Pig script--currently running in local mode--that processes a huge file containing a list of categories: /root/level1/level2/level3 /root/level1/level2/level3/level4 ... I need to insert each of these into an existing database by calling a stored procedure. Because I'm new to Pig and the UDF interface is a little daunting, I'...

Problem while executing hadoop code

I just started with Hadoop. I wrote a sample hadoop code as was written in the book. But still, during the time of execution exceptions arise. The snippet of what I get : [harsh@geek hadoop-0.20.2]$ hadoop MaxTemperature input/ncdc/sample.txt output Exception in thread "main" java.lang.NoClassDefFoundError: MaxTemperature Caused by: jav...

Efficient set operations in mapreduce

I have inherited a mapreduce codebase which mainly calculates the number of unique user IDs seen over time for different ads. To me it doesn't look like it is being done very efficiently, and I would like to know if anyone has any tips or suggestions on how to do this kind of calculation as efficiently as possible in mapreduce. We use H...

Basic Hadoop Map/Reduce RunJar Question

I have a four machine Hadoop cluster setup that I've verified works correctly using the bundled WordCount example running locally from the NameNode machine. I'm now starting to write my own MapReduce classes in Java which I've bundled into a JAR with the necessary driver class that extends Configured and implements Tool. I'm trying to r...

Hadoop and MS SQL Server Best Practices

Hi, I've been following Hadoop for a while, it seems like a great technology. The Map/Reduce, Clustering it's just good stuff. But I haven't found any article regarding the use of Hadoop with SQL Server. Let's say I have a huge claims table (600 million rows) and I want to take advantage of Hadoop. I was thinking but correct me if I'm ...

Read a long string into memory

Hi, I am having a very large string, and when I read it in Java, I am getting out of memory error. Actually, I need to read all this string into memory and then split into individual strings and sort them based on value. What is the best way do this? Thanks ...

Hadoop reducer string manipulation doesn't work

Hi Text manipulation in Reduce phase seems not working correctly. I suspect problem could be in my code rather then hadoop itself but you never know... If you can spot any gotchas let me know. I wasted a day trying to figure out what’s wrong with this code. my sample input file called simple.psv 12345 [email protected]|m|1975 12346 bbc@...

Distributed, error-handling, copying of TB's of data

We have a box that has terabytes of data (10-20TB) each day, where each file on the drive is anywhere from megabytes to gigabytes. We want to send all these files to a set of 'pizza boxes', where they will consume and process the files. I can't seem to find anything that is built to handle this amount of data besides distcp (hadoop). R...

Hadoop 0.20.2 Eclipse plugin not fully functioning - can't 'Run on Hadoop'.

Hi folks, I've just finished installing Hadoop 0.20.2 under Cygwin on Windows 7 with Eclipse Helios (3.6). Hadoop is now fully started, and I'm trying to run a test application within a newly created MapReduce test project in Eclipse. I'm using the Hadoop 0.20.2 plugin from the Hadoop download. The Map/Reduce Location perspective opera...

Hadoop query regarding setJarByClass method of Job class.

In the Hadoop API documentation it's given that setJarByClass public void setJarByClass(Class cls) Set the Jar by finding where a given class came from. What exactly does this explanation mean? does it creates a JAR file from the class file argument specified in the method above? and does that jar file is executed for the MapRe...

Pipeling hadoop map reduce jobs

Hi, I have five map reduce that I am running each separately. I want to pipeline them all together. So, output of one job goes to next job. Currently, I wrote shell script to execute them all. Is there a way to write this in java? Please provide an example. Thanks ...

Using Hadoop map/reduce for programming language design course project

I need to design an exercise for my students in programming language design, My idea is help them to learn ideas in lisp, ML and other functional languages by force them to implement a mapreduce exercise with hadoop. Is here any suggestion that help me detail my idea? ...

Hadoop Installation: no jobtracker to stop no namenode to stop

Hadoop-0.20.2 Single Node Setup FAIL!!!! The jobtracker and namenode do not start :( Any suggestions would be welcome. As far as i know, i have set core-site.xml, hdfs-site.xml and mapred-site.xml correctly ...