hadoop

Periodic hadoop jobs running (best practice)

Customers able to upload urls in any time to database and application should processes urls as soon as possible. So i need periodic hadoop jobs running or run hadoop job automatically from other application(any script identifies new links were added, generates data for hadoop job and runs job). For PHP or Python script, i could set up cr...

How to combine multiple Hadoop MapReduce Jobs into one?

I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input. My goal: Compute these different tasks as fast as possible. I currently let them run sequentially each reading in all the data. I assume it ...

Sequence file name being used as key in Hadoop output?

I'm trying to use Dumbo/Hadoop to calculate TF-IDF for a bunch of small text files using this example http://dumbotics.com/2009/05/17/tf-idf-revisited/ To improve efficiency, I've packaged the text files into a sequence file using Stuart Sierra's tool -- http://stuartsierra.com/2008/04/24/a-million-little-files The sequence file uses m...

How to merge 2 bzip2'ed files?

Hello, I want to merge 2 bzip2'ed files. I tried appending one to another: cat file1.bzip2 file2.bzip2 > out.bzip2 which seems to work (this file decompressed correctly), but I want to use this file as a Hadoop input file, and I get errors about corrupted blocks. What's the best way to merge 2 bzip2'ed files without decompressing them? ...

Tools for optimizing scalability of an Hadoop application?

I'm working with a team of mine on a small application that takes a lot of input (logfiles of a day) and produces useful output after several (now 4, in the future perhaps 10) map-reduce steps (Hadoop & Java). Now I've done a partial POC of this app and ran it on 4 old desktops (my Hadoop test cluster). What I've noticed is that if you ...

Making graphs of hadoop runs.

On some websites (like in this PDF : http://sortbenchmark.org/Yahoo2009.pdf) I see very nice graphs that visualize what an Hadoop cluster is doing at what moment. Were these made "manually" (i.e. with some homemade tool) or is there a "ready to run" script/tool that produces something like this for me? ...

What do you recommend for a Hadoop book?

I've started getting into technology books to read. I want to learn Hadoop, and I find that I enjoy just reading books rather than staring at a computer screen all the time. I've found two books: Hadoop: The Definitive Guide Pro Hadoop I could go by Amazon reviews, but I was wondering what the community thought. If there's anothe...

hadoop null pointer exception

import java.awt.image.BufferedImage; import java.awt.image.DataBufferByte; import java.awt.image.Raster; import java.io.ByteArrayInputStream; import java.io.IOException; import javax.imageio.ImageIO; import javax.xml.soap.Text; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public c...

Even data distribution on hadoop/hive

I am trying a small hadoop setup (for experimentation) with just 2 machines. I am loading about 13GB of data, a table of around 39 million rows, with a replication factor of 1 using Hive. My problem is hadoop always stores all this data on a single datanode. Only if I change the dfs_replication fatcor to 2 using setrep, hadoop copies dat...

hadoop, map/reduce output file(part-00000) and distributed cache

hi friends, the value ouput from my map/reduce is a bytewritable array, which is written in the output file part-00000 (hadoop do so by default). i need this array for my next map function so i wanted to keep this array in distributed cache. can sombody tell how can i read from outputfile (part-00000) which may not be a text file and st...

Which class that looks like MapWritable can be used as the Key in a Hadoop MapReduce program?

I'm writing a program in Java using Hadoop 0.18. Now I would like to use a Map (HashMap/TreeMap/...) kind of datastructure as the Key in my Map Reduce processing. I haven't yet been able to find an official Hadoop class that is essentially an MapWritableComparable (i.e. implements Map, Writable and Comparable). So for my first tests I ...

Map-Side Join Algorithm for MapReduce

Hi Everyone, I am trying to use the Hadoop's Map-side join using CompositeInputFormat but I get an IOException: "Unmatched ')'". I guess there may be a problem in the format of my input file. I have formatted the input files manually in such a way that keys are in sorted order in both the input files. Is this correct or do I have to pas...

Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)

I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs. ...

distributed cache

i am working with hadoop 19 on opensuse linux, i am not using any cluster rather running my hadoop code on my machine itself. i am following the standard technique on putting in distributed cache but instead of acessing the files from the distributed cache again and again, i stored the contents of the file in an array. this part of extra...

Suggestions for a scalable architecture solution to large data problem

Hi folks, I am in the process of building/architecting a business social network web application that has a component that I think will lead to major scalability issues and I'd like to get some feedback/thoughts on the best way forward. The application has a User object. The idea is, that every time a new user joins the system he ranks...

FileNotFoundException when using Hadoop distributed cache

this time someone should please relpy i am struggling with running my code using distributed cahe. i have already the files on hdfs but when i run this code : import java.awt.image.BufferedImage; import java.awt.image.DataBufferByte; import java.awt.image.Raster; import java.io.BufferedReader; import java.io.ByteArrayInputStream; impor...

urgent Attention Required-hadoop: BufferedImage and ConvolveFilter-->JHLabs:

sorry to disturb again but i like learning here. i am using JHLabs library on filters for buffered images.on running my code i am getting this exception:L java.lang.ArrayIndexOutOfBoundsException: 4 at com.jhlabs.image.ConvolveFilter.convolveHV(ConvolveFilter.java:175) at com.jhlabs.image.ConvolveFilter.convolve(ConvolveFilter.j...

Hadoop Reduce Error

Hi! I keep getting Exceeded MAX_FAILED_UNIQUE_FETCHES; on the reduce phase even though I tried all the solutions I could find online. Please help me, I have a project presentation in three hours and my solution doesn't scale. I have one master that is NameNode and JobTracker (172.16.8.3) and 3 workers (172.16.8.{11, 12, 13}) Here are...

HadoopDb Java Program

Hi Felix Kling , first of all thanks for showing interest. I'm Adarsh Sharma presently working on Hadoop Technologies such as Hive, Hadoop, HadoopDB , Hbase etc. I have configured HadoopDB on the Hadoop Cluster of 3 nodes with Postgres as the database layer. I load a table website_master containing 3 columns in hadoopDB in chunked form...

How to use custom pool assignment for FairScheduler in Hadoop?

I am trying to take advantage of multiple pools in FairScheduler. But all my jobs are submitted by a single agent process and therefore all belong to same user. I have set mapred.fairscheduler.poolnameproperty to scheduler.pool.name and then in each job I set "scheduler.pool.name" to a specific pool from pools.xml that I want to use for...