Customers able to upload urls in any time to database and application should processes urls as soon as possible. So i need periodic hadoop jobs running or run hadoop job automatically from other application(any script identifies new links were added, generates data for hadoop job and runs job). For PHP or Python script, i could set up cr...
I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input.
My goal: Compute these different tasks as fast as possible.
I currently let them run sequentially each reading in all the data. I assume it ...
I'm trying to use Dumbo/Hadoop to calculate TF-IDF for a bunch of small text
files using this example http://dumbotics.com/2009/05/17/tf-idf-revisited/
To improve efficiency, I've packaged the text files into a sequence
file using Stuart Sierra's tool -- http://stuartsierra.com/2008/04/24/a-million-little-files
The sequence file uses m...
Hello,
I want to merge 2 bzip2'ed files. I tried appending one to another: cat file1.bzip2 file2.bzip2 > out.bzip2 which seems to work (this file decompressed correctly), but I want to use this file as a Hadoop input file, and I get errors about corrupted blocks.
What's the best way to merge 2 bzip2'ed files without decompressing them?
...
I'm working with a team of mine on a small application that takes a lot of input (logfiles of a day) and produces useful output after several (now 4, in the future perhaps 10) map-reduce steps (Hadoop & Java).
Now I've done a partial POC of this app and ran it on 4 old desktops (my Hadoop test cluster). What I've noticed is that if you ...
On some websites (like in this PDF : http://sortbenchmark.org/Yahoo2009.pdf) I see very nice graphs that visualize what an Hadoop cluster is doing at what moment.
Were these made "manually" (i.e. with some homemade tool) or is there a "ready to run" script/tool that produces something like this for me?
...
I've started getting into technology books to read. I want to learn Hadoop, and I find that I enjoy just reading books rather than staring at a computer screen all the time. I've found two books:
Hadoop: The Definitive Guide
Pro Hadoop
I could go by Amazon reviews, but I was wondering what the community thought. If there's anothe...
import java.awt.image.BufferedImage;
import java.awt.image.DataBufferByte;
import java.awt.image.Raster;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import javax.imageio.ImageIO;
import javax.xml.soap.Text;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public c...
I am trying a small hadoop setup (for experimentation) with just 2 machines. I am loading about 13GB of data, a table of around 39 million rows, with a replication factor of 1 using Hive. My problem is hadoop always stores all this data on a single datanode. Only if I change the dfs_replication fatcor to 2 using setrep, hadoop copies dat...
hi friends,
the value ouput from my map/reduce is a bytewritable array, which is written in the output file part-00000 (hadoop do so by default). i need this array for my next map function so i wanted to keep this array in distributed cache. can sombody tell how can i read from outputfile (part-00000) which may not be a text file and st...
I'm writing a program in Java using Hadoop 0.18.
Now I would like to use a Map (HashMap/TreeMap/...) kind of datastructure as the Key in my Map Reduce processing. I haven't yet been able to find an official Hadoop class that is essentially an MapWritableComparable (i.e. implements Map, Writable and Comparable).
So for my first tests I ...
Hi Everyone,
I am trying to use the Hadoop's Map-side join using CompositeInputFormat but I get an IOException: "Unmatched ')'". I guess there may be a problem in the format of my input file. I have formatted the input files manually in such a way that keys are in sorted order in both the input files. Is this correct or do I have to pas...
I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs.
...
i am working with hadoop 19 on opensuse linux, i am not using any cluster rather running my hadoop code on my machine itself. i am following the standard technique on putting in distributed cache but instead of acessing the files from the distributed cache again and again, i stored the contents of the file in an array. this part of extra...
Hi folks,
I am in the process of building/architecting a business social network web application that has a component that I think will lead to major scalability issues and I'd like to get some feedback/thoughts on the best way forward.
The application has a User object. The idea is, that every time a new user joins the system he ranks...
this time someone should please relpy
i am struggling with running my code using distributed cahe. i have already the files on hdfs but when i run this code :
import java.awt.image.BufferedImage;
import java.awt.image.DataBufferByte;
import java.awt.image.Raster;
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
impor...
sorry to disturb again but i like learning here.
i am using JHLabs library on filters for buffered images.on running my code i am getting this exception:L
java.lang.ArrayIndexOutOfBoundsException: 4
at com.jhlabs.image.ConvolveFilter.convolveHV(ConvolveFilter.java:175)
at com.jhlabs.image.ConvolveFilter.convolve(ConvolveFilter.j...
Hi!
I keep getting Exceeded MAX_FAILED_UNIQUE_FETCHES; on the reduce phase even though I tried all the solutions I could find online. Please help me, I have a project presentation in three hours and my solution doesn't scale.
I have one master that is NameNode and JobTracker (172.16.8.3) and 3 workers (172.16.8.{11, 12, 13})
Here are...
Hi Felix Kling , first of all thanks for showing interest.
I'm Adarsh Sharma presently working on Hadoop Technologies such as Hive, Hadoop, HadoopDB , Hbase etc.
I have configured HadoopDB on the Hadoop Cluster of 3 nodes with Postgres as the database layer.
I load a table website_master containing 3 columns in hadoopDB in chunked form...
I am trying to take advantage of multiple pools in FairScheduler. But all my jobs are submitted by a single agent process and therefore all belong to same user.
I have set mapred.fairscheduler.poolnameproperty to scheduler.pool.name and then in each job I set "scheduler.pool.name" to a specific pool from pools.xml that I want to use for...