views:

23

answers:

1

I'm using Map Reduce with MongoDB. Simplified scenario: There are users, items and things. Items include any number of things. Each user can rate things. Map reduce is used to calculate the aggregate rating for each user on each item. It's a complex formula using the ratings for each thing in the item and the time of day - it's not something you could ever index on and thus map-reduce is an ideal approach to calculating it.

The question is: having calculated the results using Map Reduce what strategies do people use to maintain these per-user results collections in their NOSQL databases?

1) On demand with automatic deletion: Keep them around for some set period of time and then delete them; regenerate them as necessary when the user makes a new request?

2) On demand never delete: Keep them around indefinitely. When the user makes a request and the collection is past it's use-by date, regenerate it.

3) Scheduled: Regular process running to update all results collections for all users?

4) Other?

+1  A: 

The best strategy depends on the nature of your map-reduce job.

If you're using a separate map-reduce call for each individual user, I would go with the first or second strategy. The advantage of the second strategy over the first strategy is that you always have a result ready. So when the user makes a request and the result is outdated, you can still present the old result to the user, while running a new map-reduce in the background to generate a fresh result for the next requests. This has the following advantages:

  • The user doesn't have to wait for the map-reduce to complete, which is important if the map-reduce may take a while to complete. The exception is of course the very first map-reduce call; at this point there is no old result available.
  • You're automatically running map-reduce only for the active users, reducing the load on the database.

If you're using a single, application-wide map-reduce call for all users, the third strategy is the best approach. You can easily achieve this by specifying an output collection. The advantages of this approach:

  • You can easily control the freshness of the result. If you need more up-to-date results, or need to reduce the load on the database, you only have to adjust the schedule.
  • Your application code isn't responsible for managing the map-reduce calls, which simplifies your application.

If a user can only see his or her own ratings, I'd go with strategy one or two, or include a lastActivity timestamp in user profiles and run an application-wide scheduled map-reduce job on the active subset of the users (strategy 3). If a user can see any other user's ratings, I'd go with strategy 3 as well, as this greatly reduces the complexity of the application.

Niels van der Rest