Suggestions for a scalable architecture solution to large data problem

views:

answers:

Suggestions for a scalable architecture solution to large data problem

Hi folks,

I am in the process of building/architecting a business social network web application that has a component that I think will lead to major scalability issues and I'd like to get some feedback/thoughts on the best way forward.

The application has a User object. The idea is, that every time a new user joins the system he ranks everyone else's "usefulness" to him based on a set of factors. Similarly, every other user on the system ranks him/her.

However, I'm worried about the scalability implications of this approach. For example, if 10,000 users join the system we are talking about 10,000^2 calculations to be stored to the database. That is 100 million records so that clearly becomes problematic both in terms of time taken to calculate these rankings but also in terms of storing this in a database.

Thus, I'm looking for help/inspiration :)

My background is in java and I've been looking at hadoop/map-reduce as a possible way to implement the calculations in a parallel manner, however I really am not sure whether this problem is applicable to Map Reduce or as to what is the best approach in general.

So, I suppose there are two specific parts to my query..

1) To do the actual calculations, should I do these in a parallel manner, ie..is Map Reduce a good approach for this problem

2) To store the rankings, what should I be using...is a standard relational database a bad idea, ie...this won't be a good fit for MySQL...should I look at something like Cassandra, HBase or some other NoSQL solution?

Any help/ideas is greatly appreciated.

cheers, Brian

I'd suggest only storing "real" values (those entered by a user). That way, users rank the other users that have value to them, and all the rest are assumed to be "useless" ;). Therefore, you'll store maybe a couple hundred values for each user. I'm assuming you're not really going to make each new user go through the entire list of other users and rank them individually, right?

You could also cut your space requirements down by making bidirectional associations that store both users' evaluations (one record links user A with user F and notes that A ranks F as a 5, and F ranks A as a 3). Cuts your space requirements in half, roughly, but it's still a lot of records. Plus you'll want indexes on both user keys, since you'll have to search both to find all records for a single user.

TMN 2010-07-09 12:51:46

I'm afraid your first idea is not a runner for this application. The system has to rank the other users based on each user's profile.I like the second idea though, definitely cuts the space requirement in half. Thanks for that!

Brian 2010-07-09 12:58:10

Is there any segregation/segmentation you can use to split up/categorize the rankings? Say by group, division, project? At least that way you would have several smaller tables, at the cost of somewhat more complex queries. Or you could use that approach to take advantage of your database's partitioning capabilities to divide a single table among multiple disks/servers/nodes.

TMN 2010-07-09 13:08:31

@Brian, If I understand the problem correct, If the ranks are based on the profile, they have to be calculated in real time, as user can update profile anytime. In this case, TMN's first suggestion makes sense because you have to store only user entered ranks. You do not need to store system calculated ranks. This way you can reduce your storage.

devcoder 2010-07-09 13:13:33

It's definitely something I've been thinking about and I am coming to the conclusion that I'll probably have to do this. However, even across one of the categories on which I group by, I'd like to be able to handle large datasets...

Brian 2010-07-09 13:16:20

While 100m rows is surely big, it may not be as big as you think. I deal with a MySQL db that has a table with more than 10m rows that joins to other tables with more than 100k rows without too many problems. The important point is to get your indexes right and make your queries efficient. Perhaps before spending too much time thinking about a super-architecture, fill a play table with the rows you think might be in it and also write some queries you think you'll be writing and see if its manageable.

JamesG 2010-07-09 12:57:37

Hi James, makes total sense, however my worry is if the application scales at all then I'm in big trouble...10,000 users goes to 100 million rows however if I get 100,000 users then i'm looking at 10bn row entries.

Brian 2010-07-09 13:14:55

+1 A:

Before throwing the brute force of MapReduce to the problem, I'd try to reduce the search space. If it's a social network of even 10K users, that means to a particular user most other users are not known, thus not useful.

I would therefore try to limit the space of users to evaluate based on criteria that fit your social network. For instance, perhaps limiting the search to local users might be applicable (or limit it to them initially and do a more exhaustive search later). What "local" means in practice depends on your user, the idea is to use some optimizations based on the real world.

wishihadabettername 2010-07-26 03:37:46

ansaurus

tags:

views:

answers:

Suggestions for a scalable architecture solution to large data problem

related questions