I know about map/reduce alghoritm and its use. It's using functions that are called Mappers and Reducers, but I also find people use the word Filters.
Are Filters same as Mappers or is there some significant difference?
...
I'm trying to write a couchdb view that takes a created_at timestamp in a sortable format (2009/05/07 21:40:17 +0000) and returns all documents that have a greater created_at value.
I'm specifically using couch_foo but if I can figure out how to write the view I can create it in futon or in the couch_foo model instead of letting couch_f...
A simple wordcount reducer in Ruby looks like this:
#!/usr/bin/env ruby
wordcount = Hash.new
STDIN.each_line do |line|
keyval = line.split("|")
wordcount[keyval[0]] = wordcount[keyval[0]].to_i+keyval[1].to_i
end
wordcount.each_pair do |word,count|
puts "#{word}|#{count}"
end
it gets in the STDIN all mappers intermediate values. Not f...
Is there a way to control the output filenames of an Hadoop Streaming job?
Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key.
Update:
Just found the answer - Using a Java class that derives from ...
Hello,
I've been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual machines, and did some very interesting tests. I really think I understand the Map and Reduce steps.
But here...
With the recent announcement of Google Wave I started looking into how it worked, I then found that work and research on Real-time Collaborative Editing Systems has been around for some time (the first work was done in 1989).
Google "introduced" MapReduce however that had been around for some time in functional programming as well.
Are...
I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
...
I have a user document which has a group field. This field is an array of group ids. I would like to write a view that returns (groupid as key) -> (array of user docs as val). This mapping operation seems like a good beginning.
function(doc)
{
var type = doc.type;
var groups = doc.groups;
if(type == "user" && groups.length > 0)
...
What I need is a powerful machine that will run my .NET code one hour a day. I can't use EC2 cause it will loose all my data on shutdown. I need a virtual PC that I can start at specific time, and this PC should start my .exe/service/whatever automatically. Can I ask Amazon MapReduce to start a Windows instance and execute my code?
...
Hi,
One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment.
To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sor...
I'm trying to get the number of times each item in a list is in a string in Python:
paragraph = "I eat bananas and a banana"
def tester(x): return len(re.findall(x,paragraph))
map(tester, ['banana', 'loganberry', 'passion fruit'])
Returns [2, 0, 0]
What I'd like to do however is extend this so I can feed the paragraph value into t...
How can I write map-reduce functions in Erlang for CouchDB? I am sure Erlang is faster than JavaScript.
...
I'm using what seems to be a common trick for creating a join view:
// a Customer has many Orders; show them together in one view:
function(doc) {
if (doc.Type == "customer") {
emit([doc._id, 0], doc);
} else if (doc.Type == "order") {
emit([doc.customer_id, 1], doc);
}
}
I know I can use the following query to get a sin...
Is it possible to write map/reduce jobs for Amazon Elastic MapReduce (http://aws.amazon.com/elasticmapreduce/) using .NET languages? In particular I would like to use C#.
Preliminary research suggests not. The above URL's marketing text suggests you have a "choice of Java, Ruby, Perl, Python, PHP, R, or C++", without mentioning .NET lan...
I've recently written a web app that uses couchdb. I like couchdb and it suited the app - which has a lot of dynamic behaviour and simply pulls JSON directly from couchdb. Being able to upload images via a browser is nice and it's a snap to do tweaks to document data. The replication also has made deployment a breeze as the app is a couc...
I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I...
I recently read this great article which succinctly explains the power of Google's MapReduce:
http://www.joelonsoftware.com/items/2006/08/01.html
In Mastering Delphi 2009, Marco Cantu shows a multi-threaded for loop using Anonymous functions, which is basically the Map part of MapReduce, but said it wasn't complete and there were other...
How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.
I want to experiment with implementing a map-reduce algorithm.
Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be ...
Assume I have the following input in Pig:
some
And I would like to convert that into:
s
so
som
some
I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries.
So can "pig latin" do this or is this something that requires a Java class to do that?
...
Can somebody outline the various differences between the various Hadoop Distributions available:
Cloudera - http://www.cloudera.com/hadoop
Yahoo - http://developer.yahoo.net/blogs/hadoop/
using the Apache Hadoop distro as a baseline.
Is there a good reason to using one of these distributions over the standard Apache Hadoop distro?...