views:

54

answers:

2

I'm building a Python/Pylons webapp that has been served by single server so far, now I want to investigate how it would scale among several servers with some kind of load balancer in front.

The main concern is server-side state, of course. It includes user session data, user uploaded data (pictures and the like), and cache. I want app servers to share cache, so one server doesn't have to do extra work if other has already done it. Scaling is probably not going to be an issue anytime soon, but this seems like a big architectural decision so better get it semi-right at the beginning.

For sessions, I could use cookie-based sessions: http://beaker.groovie.org/sessions.html#cookie-based

For user uploaded data and cache (both currently stored on local filesystem) I need a different approach and I'm not sure which one would be the best fit. Some of the options I've considered:

  • Distributed filesystem
    • Amazon S3 in particular, since I'm targeting Amazon as cloud provider. However, I'd like to avoid my code becoming overly vendor-specific, so changing cloud provider later is feasible.
  • [distributed] key-value store, would require to rewrite/abstract-out parts of my code that assume all data goes on filesystem
  • Somehow avoid sharing data at all, load balancer could be very clever to direct requests to nodes that have neccessary user data / cache locally. Wait, this is called sharding, right?
  • Network-accessible filesystem, NFS in particular: NFS directory exported on one (possibly dedicated) node, all others mounting it. Possible problems I can think of:
    • Bandwidth to NFS host could become a bottleneck
    • Race conditions when several clients try to access same files at the same time

I'm currently considering going with NFS--it seems to be the easiest solution that could possibly work. But then again, maybe there are more caveats that I'm not aware of, making this a short-sighted decision? What is your experience, what forms of data storage and sharing you have used for apps that hosted in cloud and are expected to scale horizontally?

+1  A: 

caching is easily accomplished using standard memecached - which can be distributed over multiple servers. NFS sounds like a bad idea since you'll need to implement your own locking mechanism to avoid race conditions. I would go for one of the distributed no-sql solutions like cassandra.

ozk
I was considering Memcache and other projects using Memcache protocol too. My main problem with Memcache is it uses memory for cache. I'm fine with slightly slower disk storage, and I might have more data to cache than fits in memory. I'm cautious about NFS too. Beaker, the python library I use for caching, has locking logic in place, but I've read out-of-sync issues are still quite possible because of aggressive caching employed by NFS clients.
Pēteris Caune
if disk storage is what you want, i would advise against NFS (exactly because of client caching which can play tricks on the integrity of your data), and opt for something like SAN storage or any other "serious" external storage solution
ozk
+1  A: 

I'd strongly recommend that you look at a distributed key/value store rather than NFS.

I'd probably use redis rather than cassandra since you are currently on one system and want to scale up to 2 systems. Cassandra while cool, is designed for systems with more writes than reads, and works best when you have 3 or more nodes. Redis on the other hand works very well with a single node deamon, essentially like memcached but with fallible persistence.

Redis is trivially easy to use under python, it is very performant, so until you are doing millions of requests you shouldn't need to worry about sharding or scaling the Redis itself, but it's failover that is likely to be the biggest issue. I've not deployed it personally, so I'm not sure how effective / easy it is to recover all the data if it ever fails and you fail over to another one. If you think that's likely then I'd investigate it.

If you want to store more complex data structures, I'd look into MongoDB or one of it's equivalents.

Michael Brunton-Spall