mapreduce

Mappers, Reducers, FIlters

I know about map/reduce alghoritm and its use. It's using functions that are called Mappers and Reducers, but I also find people use the word Filters. Are Filters same as Mappers or is there some significant difference? ...

CouchDB Views: created_at greater than a passed value

I'm trying to write a couchdb view that takes a created_at timestamp in a sortable format (2009/05/07 21:40:17 +0000) and returns all documents that have a greater created_at value. I'm specifically using couch_foo but if I can figure out how to write the view I can create it in futon or in the couch_foo model instead of letting couch_f...

Parallelizing Ruby reducers in Hadoop?

A simple wordcount reducer in Ruby looks like this: #!/usr/bin/env ruby wordcount = Hash.new STDIN.each_line do |line| keyval = line.split("|") wordcount[keyval[0]] = wordcount[keyval[0]].to_i+keyval[1].to_i end wordcount.each_pair do |word,count| puts "#{word}|#{count}" end it gets in the STDIN all mappers intermediate values. Not f...

How do I control output files name and content of an Hadoop streaming job?

Is there a way to control the output filenames of an Hadoop Streaming job? Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key. Update: Just found the answer - Using a Java class that derives from ...

How is MapReduce a good method to analyse http server logs ?

Hello, I've been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual machines, and did some very interesting tests. I really think I understand the Map and Reduce steps. But here...

What are some of the other old / researched techniques that are not in the main stream yet?

With the recent announcement of Google Wave I started looking into how it worked, I then found that work and research on Real-time Collaborative Editing Systems has been around for some time (the first work was done in 1989). Google "introduced" MapReduce however that had been around for some time in functional programming as well. Are...

MapReduce implementation in Scala

I'd like to find out good and robust MapReduce framework, to be utilized from Scala. ...

CouchDB- basic grouping question

I have a user document which has a group field. This field is an array of group ids. I would like to write a view that returns (groupid as key) -> (array of user docs as val). This mapping operation seems like a good beginning. function(doc) { var type = doc.type; var groups = doc.groups; if(type == "user" && groups.length > 0) ...

Can I run a .NET application (or method from .NET dll) in Amazon Elastic MapReduce?

What I need is a powerful machine that will run my .NET code one hour a day. I can't use EC2 cause it will loose all my data on shutdown. I need a virtual PC that I can start at specific time, and this PC should start my .exe/service/whatever automatically. Can I ask Amazon MapReduce to start a Windows instance and execute my code? ...

How does the MapReduce sort algorithm work?

Hi, One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment. To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sor...

Using map() to get number of times list elements exist in a string in Python

I'm trying to get the number of times each item in a list is in a string in Python: paragraph = "I eat bananas and a banana" def tester(x): return len(re.findall(x,paragraph)) map(tester, ['banana', 'loganberry', 'passion fruit']) Returns [2, 0, 0] What I'd like to do however is extend this so I can feed the paragraph value into t...

CouchDB: map-reduce in Erlang

How can I write map-reduce functions in Erlang for CouchDB? I am sure Erlang is faster than JavaScript. ...

What is the maximum value for a compound CouchDB key?

I'm using what seems to be a common trick for creating a join view: // a Customer has many Orders; show them together in one view: function(doc) { if (doc.Type == "customer") { emit([doc._id, 0], doc); } else if (doc.Type == "order") { emit([doc.customer_id, 1], doc); } } I know I can use the following query to get a sin...

Is it possible to write map/reduce jobs for Amazon Elastic MapReduce using .NET?

Is it possible to write map/reduce jobs for Amazon Elastic MapReduce (http://aws.amazon.com/elasticmapreduce/) using .NET languages? In particular I would like to use C#. Preliminary research suggests not. The above URL's marketing text suggests you have a "choice of Java, Ruby, Perl, Python, PHP, R, or C++", without mentioning .NET lan...

Which schemaless datastores provide good performance?

I've recently written a web app that uses couchdb. I like couchdb and it suited the app - which has a lot of dynamic behaviour and simply pulls JSON directly from couchdb. Being able to upload images via a browser is nice and it's a snap to do tweaks to document data. The replication also has made deployment a breeze as the app is a couc...

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I...

Is there a MapReduce library for Delphi?

I recently read this great article which succinctly explains the power of Google's MapReduce: http://www.joelonsoftware.com/items/2006/08/01.html In Mastering Delphi 2009, Marco Cantu shows a multi-threaded for loop using Anonymous functions, which is basically the Map part of MapReduce, but said it wasn't complete and there were other...

How would I get a subset of Wikipedia's pages?

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much. I want to experiment with implementing a map-reduce algorithm. Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be ...

Splitting input into substrings in PIG (Hadoop)

Assume I have the following input in Pig: some And I would like to convert that into: s so som some I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that? ...

Hadoop Distribution Differences

Can somebody outline the various differences between the various Hadoop Distributions available: Cloudera - http://www.cloudera.com/hadoop Yahoo - http://developer.yahoo.net/blogs/hadoop/ using the Apache Hadoop distro as a baseline. Is there a good reason to using one of these distributions over the standard Apache Hadoop distro?...