tags:

views:

15

answers:

0

Hi,

I have a distributed database (google app engine) which stores user ratings for products.

I want to provide a service where I can give a user a histogram of the ratings all their friends have made for a particular product.

A user may have a large number of friends (hundreds to a couple thousand). Selecting all the rating records of all your friends can return a large number of matches. In app engine, getting more than ~30 'records' in a single http request is not going to work well.

I was thinking of keeping some mysql instances on the side just to serve the histogram requests. There would only be a single table in the database:

// ratings
-----------------------------
username | productid | rating
-----------------------------

and when a user wants to use this service (will be done from a browser in an ajax connection) they'll already have their list of friend ids and would do something like:

// provided: a list of friend usernames.
select username from ratings where 
  productid = 'abc' and 
  username = 'friend1' or
  username = 'friend2' or
  username = 'friendN';

this could potentially return thousands of records in the worst case. (whenever a user submits a rating, I write it to the google app engine instance, then to the mysql instance on another machine).

My question is, what kind of performance can I expect with such a setup? I can get multiple commodity machines running mysql to split the load for the histogram requests. Maybe assign 100 products to each instance etc. How realistic is it to select 10,000 records from a mysql table that has a million records in it during an http request (has to return within ~3 seconds?)

I have come up with more questionable solutions, wondering if this one has any gaping holes in it other than its inherit complexity,

Thanks