mapreduce

Run Hadoop job without using JobConf

I can't find a single example of submitting a Hadoop job that does not use the deprecated JobConf class. JobClient, which hasn't been deprecated, still only supports methods that take a JobConf parameter. Can someone please point me at an example of Java code submitting a Hadoop map/reduce job using only the Configuration class (not Jo...

The difference between MapReduce and the map-reduce combination in functional programming.

I read the mapreduce at http://en.wikipedia.org/wiki/MapReduce ,understood the example of how to get the count of a "word" in many "documents". However I did not understand the following line: Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the functional pro...

Mapreduce with Riak

Does anyone have example code for mapreduce for Riak that can be run on a single Riak node. ...

STREAM keyword in pig script that runs in Amazon Mapreduce

Hi. I have a pig script, that activates another python program. I was able to do so in my own hadoop environment, but I always fail when I run my script in Amazon map reduce WS. The log say: org.apache.pig.backend.executionengine.ExecException: ERROR 2090: Received Error while processing the reduce plan: '' failed with exit status: 127...

Hadoop searching words from one file in another file

Hi, I want to build a hadoop application which can read words from one file and search in another file. If the word exists - it has to write to one output file If the word doesn't exist - it has to write to another output file I tried a few examples in hadoop. I have two questions Two files are approximately 200MB each. Checking ever...

Efficient search in a corpus

I am having a few million words which I want to search in a billion words corpus. What will be the efficient way to do this. I am thinking of a trie, but is there an open source implementation of trie available? Thank you -- Updated -- Let me add few more details about what exactly is required. We have a system where we crawled news...

New to Hadoop and dumbo, how to correctly sequence these operations?

Consider the following log file format: id v1 v2 v3 1 15 30 25 2 10 10 20 3 50 30 30 We are to calculate the average value frequency (AVF) for each data row on a Hadoop cluster using dumbo. AVF for a data point with m attributes is defined as: avf ...

Is there anything like Hadoop in C++?

What is the closest thing like Hadoop, but in C++? In particular, I want to do distributed computing using MapReduce. Thanks! ...

Efficient MapReduce when dealing with streams to queries to the same dataset

Hi, I have a massive, static dataset and I've a function to apply to it. f is in the form reduce(map(f, dataset)), so I would use the MapReduce skeleton. However, I don't want to scatter the data at each request (and ideally I want to take advantage of indexing in order to speedup f). There is a MapReduce implementation that address th...

Which Map-Reduce libary and/or platform to use with java

I was reading and hearing some stuff about cloud computing and map-reduce techniques lately. I am thinking of playing around with some algorithms to get practical experience in that field and see what is possible right now. Here is what I want to do: I would like to use some public cloud platform (e.g. Google App Engine, Google Map Redu...

Finding matching lines with Hadoop/MapReduce

I am playing around with Hadoop and have set up a two node cluster on Ubuntu. The WordCount example runs just fine. Now I'd like to write my own MapReduce program to analyze some log data (main reason: it looks simple and I have plenty of data) Each line in the log hast this format <UUID> <Event> <Timestamp> where event can be INIT,...

Map Reduce Algorithms on Terabytes of Data?

This question does not have a single "right" answer. I'm interested in running Map Reduce algorithms, on a cluster, on Terabytes of data. I want to learn more about the running time of said algorithms. What books should I read? I'm not interested in setting up Map Reduce clusters, or running standard algorithms. I want rigorous theor...

Research topic on distributed systems

Hello there. I have a research project on distributed systems, I asked the Prof. if i can work on MapReduce and he is giving me hard time that MapReduce is very broad and asked me to pick a specific problem about either distributed systems frameworks like MapReduce or something else that has networking and distributed computing in it. ...

is this architecture possible in Hadoop MR?

Is the following architecture possible in Hadoop MapReduce? A distributed key-value store is used (HBase). So along with values, there would be a timestamp associated with the values. Map & Reduce tasks are executed iteratively. Map, in each iteration should take in values which were added in the previous iteration to the store (perhaps...

What is the easiest to use distributed map reduce programming system?

What is the easiest to use distributed map reduce programming system? For example. in a distributed datastore containing many users, each with many connections, say I wanted to count the total number of connections: Map: for all records of type "user" do for each user count number of connections retrun connection_count_for_one_...

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions. All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions. Can I do this with Hadoop? I've se...

Computational Linguistics project idea using Hadoop MapReduce

I need to do a project on Computational Linguistics course. Is there any interesting "linguistic" problem which is data intensive enough to work on using Hadoop map reduce. Solution or algorithm should try and analyse and provide some insight in "lingustic" domain. however it should be applicable to large datasets so that i can use hadoo...

How to ensure MapReduce tasks are independent from each other?

I'm curious, but how does MapReduce, Hadoop, etc., break a chunk of data into independently operated tasks? I'm having a hard time imagining how that can be, considering it is common to have data that is quite interelated, with state conditions between tasks, etc. Thanks. ...

Processing live feed of logs from web server using Hadoop

I want to process the logs from my web server as it comes in using Hadoop (Amazon Elastic mapreduce). I googled for help but nothing useful. I would like to know if this can be done or is there any alternative way to do this. ...

Project Idea with Hadoop MapReduce

Hello, I learnt Hadoop a few months back and managed to do a very introductory programming project on it. I want to do a small - medium sized project or series of small programming assignments with Hadoop. I have seen lot of ideas around but I dont see anything that can be finished in about 60-70 hours of work so a pretty small scale pr...