tags:

views:

1102

answers:

5
+3  Q: 

MapReduce on Azure

Is there an implementation of MapReduce/Hadoop on Azure?

A: 

I'm not sure there's an out-of-the-box solution in Azure, but AWS might have what you're looking for (in beta): http://aws.amazon.com/elasticmapreduce/

allyourcode
+1  A: 
Rinat Abdullin
in other words, forget Azure - go straight to using Hadoop yourself, or via Amazon.
gbjbaanb
Another option is to wait for Microsoft to implement Dryad for Azure. They are already planning this (as discovered recently by looking at DryadLinq sources).
Rinat Abdullin
+5  A: 

Microsoft Research has DryadLINQ which is a powerful LINQ expression distribution engine. I hope they port this to Azure!

Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on the PLINQ parallelization framework.

It has an implementation of MapReduce like this:

public static IQueryable<Rs> MapReduce<Ts, Ms, K, Rs> (
    this IQueryable<Ts> source,
    Expression<Func<Ts, IEnumerable<Ms>>> mapper,
    Expression<Func<Ms, K>> keySelector,
    Expression<Func<IGrouping<K, Ms>, IEnumerable<Rs>>> reducer) {

    IQueryable<Ms> mapped = source.SelectMany (mapper);
    IQueryable<IGrouping<K, Ms>> groups = mapped.GroupBy (keySelector);
    return groups.SelectMany (reducer);
}

Amazingly simple implementation! With the power of DryadLINQ, I don't see why you need to be constrained to MapReduce - you can simply create the exact LINQ query that returns the information you're looking for.

NOTE: this is my approximation of their implementation - the PDF does not contain the exact method signature or implementation

George Tsiokos
DryadLINQ is for clusters and it does not work with Windows Azure. Neither it is available outside Microsoft.
Rinat Abdullin
Just because it's not available outside of microsoft right now doesn't mean it won't be in the future ;-)
Joel Martinez
I've been using DryadLinq (outside MS) for 3 months now and love it, and I agree with George about not needing to express the problem in MapReduce structure. The first thing I did in DryadLinq was implement an algorithm using the MapReduce function. Then write the algorithm using Linq. The linq implementation executed 5X faster because its not hampered with trying to express everything as KVPs. Its also not filling the hard disks with massive amounts of KVPs, so there is much less IO overhead. MapReduce is very parallizable but not very efficient. If you have a better way, use it.
Turbo
+1  A: 

maybe http://code.google.com/p/lokad-cloud/wiki/MapReduceSample

slyi
A: 

Just to share my knowledge on this. We have implemented a MapReduce on the Amazon cloud platform, using cloud services, such as queue (Amazon SQS), table (SimpleDB), and cloud storage(S3). The project is called Cloud MapReduce and it is in open source. The open source version only supports Amazon cloud, but it is fairly easy to port to Azure as Azure has the equivalent of SQS, SimpleDB and S3. Avanade has already ported Cloud MapReduce to Windows Azure, but unfortunately the source code is not open. You will have to contact Avanade to see how to use it. I would be happy to connect if anyone is interested. My contact info is at the Cloud MapReduce site.

Huan Liu