mapreduce

How do I tell a multi-core / multi-CPU machine to process function calls in a loop in parallel?

I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances. Many of the more intensive operations behave deterministically and would lend themselves to parallel processing. Provided I have a ...

RT parallel processing in Rails

I'm developing a sort of personalized search engine in Ruby on Rails, and I'm currently trying to find best way of sorting results depending on user's record, in real time. Example: items that are searched for can have tags (separate entities with ids), for example item has tags=[1, 5, 10, 23, 45]. User, on the other hand, may have fla...

How do you use MapReduce/Hadoop?

I'm looking for some general information about how other people are using Hadoop or other MapReduce-like technologies. In general, I am curious to whether you are writing MR applications to process existing data sets (like web server log files), or are you writing applications that generate and process new data sets? Edit: Follow-up Que...

Is there a .Net equivalent to Apache Hadoop?

So, I've been looking at Hadoop with keen interest, and to be honest I'm fascinated, things don't get much cooler. My only minor issue is I'm a C# developer and it's in Java. It's not that I don't understand the Java as much as I'm looking for the Hadoop.net or NHadoop or the .net project that embraces the Google MapReduce approach. Do...

Large data - storage and query

We have a huge data of about 300 million records, which will get updated every 3-6 months.We need to query this data(continously, real time) to get some information.What are the options - a RDBMS(mysql) , or some other option like Hadoop.Which will be better? ...

20 Billion Rows/Month - Hbase / Hive / Greenplum / What ?

Hi, I'd like to use your wisdom for picking up the right solution for a data-warehouse system. Here are some details to better understand the problem: Data is organized in a star schema structure with one BIG fact and ~15 dimensions. 20B fact rows per month 10 dimensions with hundred rows (somewhat hierarchy) 5 dimensions with ...

how to implement eigenvalue calculation with MapReduce/Hadoop?

It is possible because PageRank was a form of eigenvalue and that is why MapReduce introduced. But there seems problems in actual implementation, such as every slave computer have to maintain a copy of the matrix? ...

What is Map/Reduce

I hear a lot of noise about map/reduce, esp in the context of Google's massively parallel compute system. What exactly is it, and why is it "cool"? ...

Map and Reduce in .NET

What scenarios would warrant the use of the "Map and Reduce" algorithm? Is there a .NET implementation of this algorithm? ...

.NET MapReduce Implementation

Does anyone out there know of a .NET (or at least Windows) based implementation of MapReduce? We're looking for an easy way to distribute load for our C#/ASP.NET webservices. We looked at Hadoop, but it says specifically that it hasn't been tested in a production environment on Windows, which is somewhat disconcerting. ...

CouchDB - .NET or Mono Equivalent Technology

Is there any active "document-based" database projects using .NET or Mono? Something similar to CouchDB, SimpleDB, LotusNotes, etc... Prefer open source. I figure the JScript.NET technology could be used for the Map and Reduce functions over stored JSON documents. ...

Hadoop on windows server

Hello, I'm thinking about using hadoop to process large text files on my existing windows 2003 servers (about 10 quad core machines with 16gb of RAM) The questions are: Is there any good tutorial on how to configure an hadoop cluster on windows? What are the requirements? java + cygwin + sshd ? Anything else? HDFS, does it play nice ...

Where to look for contributors?

I've been recently confronted with a not so typical programming problem. Where do I look for contributors? I'm extending an already existing project, Hypertable, and I'm looking for a one or two more people to lend a hand in implementing some stuff. The extension to the project I'm working on is a MapReduce framework which once done will...

Implementing user ratings / favorites on CouchDB

I'm considering using CouchDB for an upcoming site, but I'm a little confused as far as far as how to implement a system of user ratings for the site. Basically, each item of content can be rated by a given user. What way of doing this makes the most sense in the CouchDB model? I would think the DRYest and most logical way would be to ha...

How to easily apply a function to a collection in C++

I'm storing images as arrays, templated based on the type of their elements, like Image<unsigned> or Image<float>, etc. Frequently, I need to perform operations on these images; for example, I might need to add two images, or square an image (elementwise), and so on. All of the operations are elementwise. I'd like get as close as possibl...

What type of problems can mapreduce solve?

Is there a theoretical analysis available which describes what kind of problems mapreduce can solve? ...

What programming language is Google written in?

I mean google the search engine. ...

What is the use of the 'key K1' in the org.apache.hadoop.mapred.Mapper ?

I'm learning Apache Hadoop and I was looking at the WordCount example org.apache.hadoop.examples.WordCount. I've understand this example, however I can see that the variable LongWritable key was not used in (...) public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, ...

Hadoop: map/reduce from HDFS

I may be wrong, but all(?) examples I've seen with Apache Hadoop takes as input a file stored on the local file system (e.g. org.apache.hadoop.examples.Grep) Is there a way to load and save the data on the Hadoop file system (HDFS)? For example I put a tab delimited file named 'stored.xls' on HDFS using hadoop-0.19.1/bin/hadoop dfs -put...

Implementing large scale log file analytics

Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics? Focusing on web analytics in particular, I'm interested in two closely-related aspects: query perf...