views:

16

answers:

2

I pre-generate 20+ million gzipped html pages, store them on disk, and serve them with a web server. Now I need this data to be accessible by multiple web servers. Rsync-ing the files takes too long. NFS seems like it may take too long.

I considered using a key/value store like Redis, but Redis only stores strings as values, and I suspect it will choke on gzipped files.

My current thinking is to use a simple MySQL/Postgres table with a string key and a binary value. Before I implement this solution, I wanted to see if anyone else had experience in this area and could offer advice.

A: 

I've head good about Redis, that's one.

I've also heard extremely positive things about memcached. It is suitable for binary data as well.
Take Facebook for example: These guys use memcached, also for the images!
As you know, images are in binary.

So, get memcached, get a machine to utilize it, a binder for PHP or whatever you use for your sites, and off you go! Good luck!

Poni
My problem with memcached is that if power goes out I lose my data. So that in addition to recovering from a power outage, I have to re-build my cache.
Scott
It depends on what you're looking for, exactly, and what is the budget. Consider having "mirrors" or data so even if one machine fails, the other(s) may serve. Additionally, if the server fails, and is the only one, you could write a script to re-upload the data from the hdisk or something upon startup. There are many options. I wouldn't go for an ACID database since it has a lot of overhead which is not needed for that purpose. It also doesn't cache, as much as I know, the data in a way that memcached does, which on the latter is designed specifically for your purpose, thus optimized for that
Poni
@Poni I agree about the database, but I've been given the luxury of exploring a few options so I figured what-the-heck.
Scott
A: 

First off, why cache the gzips? Network latency and transmission time is orders of magnitude higher than the CPU time spent compressing the file so doing it on the fly maybe the simplest solution.

However,if you definitely have a need then I'm not sure a central database is going to be any quicker than a file share (of course you should be measuring not guessing these things!). A simple approach could be to host the original files on an NFS share and let each web server gzip and cache them locally on demand. memcached (as Poni suggests) is also a good alternative, but adds a layer of complexity.

Paolo
I'm in the process of benchmarking filesystem, postgres and tokyo cabinet. I'll update my question with the results.
Scott