hadoop

Which Hadoop API version should I use?

In the latest Hadoop Studio the 0.18 API of Hadoop is called "Stable" and the 0.20 API of Hadoop is called "Unstable". The distribution that comes from Yahoo is a 0.20 (with yahoo patches), which is apparently "the way to go". From cloudera they state the 0.20 (with cloudera patches) is also stable. Now given the fact that we'll start ...

Managing dependencies with Hadoop Streaming?

hi all, had a quick hadoop streaming question.. If I'm using python streaming and I have python packages my mappers/reducers require that aren't installed by default do I need to install those on all the hadoop machines as well or is there some sort of serialization that sends them to the remote machines? thanks! ...

I'm familiar with Python and its data structures. Can someone give me a very basic example on how to use Hadoop Mapreduce?

What can I do with Mapreduce? Dictionaries? Lists? What do I use it for? Give a real easy example ...

Converting python collaborative filtering code to use Map Reduce

Using Python, I'm computing cosine similarity across items. given event data that represents a purchase (user,item), I have a list of all items 'bought' by my users. Given this input data (user,item) X,1 X,2 Y,1 Y,2 Z,2 Z,3 I build a python dictionary {1: ['X','Y'], 2 : ['X','Y','Z'], 3 : ['Z']} From that dictionary, I generate a...

What's the best way to count unique visitors with Hadoop?

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this... DATE siteID action username 05-05-2010 siteA pageview jim 05-05-2010 siteB pageview tom 05-05-2010 siteA pageview jim 05-05-2010 siteB pageview bob 05-05-2010 siteA ...

Global variables in hadoop.

Hi, My program follows a iterative map/reduce approach. And it needs to stop if certain conditions are met. Is there anyway i can set a global variable that can be distributed across all map/reduce tasks and check if the global variable reaches the condition for completion. Something like this. While(Condition != true){ C...

getting close to real-time with hadoop

I need some good references for using Hadoop for real-time systems like searching with little response time. I know hadoop has its overhead of hdfs, but whats the best way of doing this with hadoop. ...

Getting started with massive data

I'm a math guy and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I n...

which Distribution of Linux is best suited for Nutch-Hadoop?

Hi experts, we are Trying to figure out which Distribution of Linux be best suited for the Nutch-Hadoop Integration?. we are planning to Use Clusters for Crawling large contents through Nutch. Let me Know if You need more clarification on this question?. Thanks you. ...

Repository organization for Hadoop project

I am starting on a new Hadoop project that will have multiple hadoop jobs(and hence multiple jar files). Using mercurial for source control, I was wondering what would be optimal way of organizing the repository structure? Should each job live in separate repo or would it be more efficient to keep them in the same, but break down into fo...

hadoop mapper static initialisation

Hi, I have a code fragment in which I am using a static code block to initialize a variable. public static class JoinMap extends Mapper<IntWritable, MbrWritable, LongWritable, IntWritable> { ....... public static RTree rt = null; static { String rtreeFileName = "R.rtree"; rt...

strange error in running executable files (linux)

I try to run an executable file on newly installed Ubuntu and I get this strange error >./hadoop hadoop : Not a directoryh >hadoop hadoop command not found the first error says "directoryh", what is the reason for these messages ...

fft algorithm implementation with hadoop

Hi, I want to implement Fast Fourier Transform algorithm with Hadoop. I know recursive-fft algorithm but I need your guideline in order to implement it Map/Reduce approach. Any suggestions? Thanks. ...

Any Open Source Pregel like framework for distributed processing of large Graphs?

Google has described a novel framework for distributed processing on Massive Graphs. http://portal.acm.org/citation.cfm?id=1582716.1582723 I wanted to know if similar to Hadoop (Map-Reduce) are there any open source implementations of this framework? I am actually in process of writing a Pseudo distributed one using python and multip...

Need help implementing this algorithm with map Hadoop MapReduce

Hi all! i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my...

0.20.2 API hadoop version with java 5

I have started a maven project trying to implement the MapReduce algorithm in java 1.5.0_14. I have chosen the 0.20.2 API hadoop version. In the pom.xml i'm using thus the following dependency: < dependency> < groupId>org.apache.hadoop< /groupId> < artifactId>hadoop-core< /artifactId> < version>0.20.2< /version> < /depend...

Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3.

Hi, I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when i login into the Master node and submit the following command bin/hadoop jar <program>.jar <arg1> <arg2> <path/to/input/file/on/S3> It throws the following errors (not at the same time.) The first error is thrown when i don't replace the slashes with '%2F' ...

Hadoop: Processing large serialized objects

I am working on development of an application to process (and merge) several large java serialized objects (size of order GBs) using Hadoop framework. Hadoop stores distributes blocks of a file on different hosts. But as deserialization will require the all the blocks to be present on single host, its gonna hit the performance drasticall...

Hadoop: Mapping binary files

Typically in a the input file is capable of being partially read and processed by Mapper function (as in text files). Is there anything that can be done to handle binaries (say images, serialized objects) which would require all the blocks to be on same host, before the processing can start. ...

Hadoop 0.18.3 API with java 5

I'm using the Hadoop 0.18.3 version in combination with java 5 and I'm trying to run the WordCount v1.0 example of the http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html. But I get the following error 0/06/10 15:28:10 WARN fs.FileSystem: uri=file:/// javax.security.auth.login.LoginException: Login failed: CreateP...