Is there a theoretical analysis available which describes what kind of problems mapreduce can solve?
For problems requiring processing and generating large data sets. Say running an interest generation query over all accounts a bank hold. Say processing audit data for all transactions that happened in the past year in a bank. The best use case is from Google - generating search index for google search engine.
Anything that involves doing operations on a large set of data, where the problem can be broken down into smaller independent sub-problems who's results can then be aggregated to produce the answer to the larger problem.
A trivial example would be calculating the sum of a huge set of numbers. You split the set into smaller sets, calculate the sums of those smaller sets in parallel (which can involve splitting those into yet even smaller sets), then sum those results to reach the final answer.
Comparison between MapReduce and SQL (read the comments too) :
http://www.data-miners.com/blog/2008/01/mapreduce-and-sql-aggregations.html
Criticism of MapReduce:
http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
http://www.databasecolumn.com/2008/01/mapreduce-continued.html
More debate:
http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php
Many problems that are "Embarrassingly Parallel" (great phrase!) can use MapReduce. http://en.wikipedia.org/wiki/Embarrassingly_parallel
From this article.... http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm ...
Doug Cutting, founder of Hadoop (an open source implementation of MapReduce) says... “Facebook uses Hadoop to analyze user behavior and the effectiveness of ads on the site"
and... “the tech team at The New York Times rented computing power on Amazon’s cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.”
In Map-Reduce for Machine Learning on Multicore Chu et al describe "algorithms that fit the Statistical Query model can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers." They specifically implement 10 algorithms including e.g. weighted linear regression, k-Means, Naive Bayes, and SVM, using a map-reduce framework.
The Apache Mahout project has released a recent Hadoop (Java) implementation of some methods based on the ideas from this paper.
You can also watch the videos @ Google, I'm watching them myself and I find them very educational.