hadoop

Availiable reducers in Elastic MapReduce

I hope I'm asking this in the right way. I'm learning my way around Elastic MapReduce and I've seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In Amazon's "Introduction to Amazon Elastic MapReduce" PDF it states "Amazon Elastic MapReduce has a default reducer called aggregrate" What I wo...

Scalable Image Storage

Hi, I'm currently designing an architecture for a web-based application that should also provide some kind of image storage. Users will be able to upload photos as one of the key feature of the service. Also viewing these images will be one of the primary usages (via web). However, I'm not sure how to realize such a scalable image sto...

A Servlet Container on top of Hadoop?

i'm on the architectural phase of a big project and i've decided to use hbase as my database, and will use map/reduce jobs for my processing so my architecture works totally under hadoop. The thing is i also need to implement some REST, SOAP API's some web pages too so i was thinking is there any servlet container that runs on top of h...

Hadoop or Hadoop Streaming for MapReduce on AWS

I'm about to start a mapreduce project which will run on AWS and I am presented with a choice, to either use Java or C++. I understand that writing the project in Java would make more functionality available to me, however C++ could pull it off too, through Hadoop Streaming. Mind you, I have little background in either language. A simi...

Converting word docs to pdf using Hadoop

Say if I want to convert 1000s of word files to pdf then would using Hadoop to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues? Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent ...

Amazon MapReduce no reducer job

Hi. I am trying to create a mapper only job via AWS (a streaming job). The reducer field is required, so I am giving a dummy executable, and adding -jobconf mapred.map.tasks=0 to the Extra Args box. In the hadoop environment (version 0.20) I've installed, no reducer jobs will launch, but in AWS the dummy executable launches and fails. ...

Question on hadoop "java.lang.RuntimeException: java.lang.ClassNotFoundException: "

Here's my source code import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoo...

Error in starting namenode in Hadoop

hi, When I try to format the namenode or even start it I'm getting the below error. What should be done?? $ bin/hadoop namenode -format Exception in thread "main" java.lang.NoClassDefFoundError: Caused by: java.lang.ClassNotFoundException: at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.securi...

Hadoop pig latin style guide?

Hi, I'm looking to take the short cut on formatting/style for pig latin (hadoop-ay). Does anyone know where I can find a style guide? -daniel ...

Question about using C# to talk to Hadoop FileSystem.

Currently my application uses C# with MONO on Linux to communicate to local file systems (e.g. ext2, ext3). The basic operations are open a file, write/read from file and close/delete the file. For this, currently i use C# native APIs (like File.Open) to operate on the file. My Question is: If i install Hadoop file system on my Linux bo...

How do I use a more recent version of a hadoop/lib jar in my map/reduce jobs?

Hadoop currently ships with commons-httpclient-3.0.1.jar in its lib folder. If I have a map/reduce task that requires commons-httpclient-3.1.jar, it does not seem to be sufficient to bundle this jar in the lib folder of my hadoop jar (as one would do with any normal external jar dependencies), as hadoop seems to be loading the previous ...

Remote java program execution using ftp, very large dataset on remote machine - program to data vs data to program

Hi all, I am developing a java based application; its pertinent requirements are listed below Large datasets exist on several machines on network. my program needs to (remotely) execute a java program to process these data sets and fetch the results A user on a windows desktop will need to process datasets (several gigs) on machine A....

Error in using Hadoop MapReduce in Eclipse

When I executed a MapReduce program in Eclipse using Hadoop, I got the below error. It has to be some change in path, but I'm not able to figure it out. Any idea? 16:35:39 INFO mapred.JobClient: Task Id : attempt_201001151609_0001_m_000006_0, Status : FAILED java.io.FileNotFoundException: File C:/tmp/hadoop-Shwe/mapred/local/taskTracker...

Any scalable OLAP database (web app scale)?

I have an application that requires analytics for different level of aggregation, and that's the OLAP workload. I want to update my database pretty frequently as well. e.g., here is what my update looks like (schema looks like: time, dest, source ip, browser -> visits) (15:00-1-2-2010, www.stackoverflow.com, 128.19.1.1, safari) --> 10...

Very basic question about Hadoop and compressed input files

I have started to look into Hadoop. If my understanding is right i could process a very big file and it would get split over different nodes, however if the file is compressed then the file could not be split and wold need to be processed by a single node (effectively destroying the advantage of running a mapreduce ver a cluster of paral...

Error in Hadoop MapReduce

When I run a mapreduce program using Hadoop, I get the following error. 10/01/18 10:52:48 INFO mapred.JobClient: Task Id : attempt_201001181020_0002_m_000014_0, Status : FAILED java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) 10/01/18 10:52:48 WARN map...

Dynamic Nodes in Hadoop

Is it possible to add new nodes to Hadoop after it is started? I know that you can remove nodes (as that the master tends to keep tabs on the node state). ...

Run Hadoop job without using JobConf

I can't find a single example of submitting a Hadoop job that does not use the deprecated JobConf class. JobClient, which hasn't been deprecated, still only supports methods that take a JobConf parameter. Can someone please point me at an example of Java code submitting a Hadoop map/reduce job using only the Configuration class (not Jo...

programming in mahout

what is the step-by-step procedure for executing a program in mahout ...

STREAM keyword in pig script that runs in Amazon Mapreduce

Hi. I have a pig script, that activates another python program. I was able to do so in my own hadoop environment, but I always fail when I run my script in Amazon map reduce WS. The log say: org.apache.pig.backend.executionengine.ExecException: ERROR 2090: Received Error while processing the reduce plan: '' failed with exit status: 127...