views:

79

answers:

3

My main aim is to serve large number of XML files ( > 1bn each <1kb) via web server. Files can be considered as staic as those will be modified by external code, in relatively very low frequency (about 50k updates per day). Files will be requested in high frequency (>30 req/sec).

Current suggestion from my team is to create a dedicated Java application to implement HTTP protocal and use memcached to speed up the thing, keeping all file data in RDBMS and getting rid of file system.

On other hand, I think, a tweaked Apache Web Server or lighttpd should be enough. Caching can be left to OS or web server's defalt caching. There is no point in keeping data in DB if the same output is required and only queried based on file name. Not sure how memcached will work here. Also updating external cache (memcached) while updating file via external code will add complexity.

Also other question, if I choose to use files is is possible to store those in directory like \a\b\c\d.xml and access via abcd.xml? Or should I put all 1bn files in single directory (Not sure OS will allow it or not).

This is NOT a website, but for an application API in closed network so Cloud/CDN is of no use.

I am planning to use CentOS + Apache/lighttpd. Suggest any alternative and best possible solution.

This is the only public note found on such topic, and it is little old too.

A: 

I believe that a dedicated application with everything feeding off a memcache db would be the best bet.

Josh K
Why? Give us some reasons...
Codo
I don't think memcache db would do wonders as the memcache hit ratio will be very less, once data starts to exceed the cache size (limited by available RAM), in fact, more will be the burden of updating cache to be in sync with persistent storage, and checking for each possible request for hit/miss. I am more inclined towards lighhttpd and structured file system as suggested by Codo, but I might be wrong. The advantage here is OS will keep a cache and it will remove file from cache if it is updated by external process. Need to test both approach.
Ashish Patil
+3  A: 

1bn files at 1KB each, that's about 1TB of data. Impressive. So it won't fit into memory unless you have very expensive hardware. It can even be a problem on disk if your file system wastes a lot of space for small files.

30 requests a second is far less impressive. It's certainly not the limiting factor for the network nor for any serious web server out there. It might be a little challenge for a slow harddisk.

So my advice is: Put the XML files on a hard disk and serve them with a plain vanilla web server of your choice. Then measure the throughput and optimize it, if you don't reach 50 files a second. But don't invest into anything unless you have shown it to be a limiting factor.

Possible optimizations are:

  • Find a better layout in the file system, i.e. distribute your files over enough directories so that you don't have too many files (more than 5,000) in a single directory.
  • Distribute the files over several harddisks so that they can access the files in parallel
  • Use faster harddisk
  • Use solid state disks (SSD). They are expensive, but can easily serve hundreds of files a second.

If a large number of the files are requested several times a day, then even a slow hard disk should be enough because your OS will have the files in the file cache. And with today's file cache size, a considerable amount of your daily deliveries will fit into the cache. Because at 30 requests a second, you serve 0.25% of all files a day, at most.

Regarding distributing your files over several directories, you can hide this with an Apache RewriteRule, e.g.:

RewriteRule ^/xml/(.)(.)(.)(.)(.*)\.xml /xml/$1/$2/$3/$4/$5.xml
Codo
@Codo: I didn't translate "bn" to "billion."
Josh K
Exactly what I was thinking - building an app on top is just going to make it slower. A very important question though is latency - there is **no** "web server's defalt caching" but the more latency in updates being published the more load could be serviced by caching off the webserver.
symcbean
You could also use NginX, and then use the built-in regex parsing to put the files into subdirectories. At this quantity of files, splitting them up by multiple levels of subdirectories would be almost required on a stock file-system (for example ext3)
Alister Bulman
I am presently testing Codo's implementation. As of now, there is no issue as very less data is present. But I think this should work. Along with RewriteRule, I have also implemented ErrorDocument 404 for requests where data is not present to send Canned reply. I am not sure how I can change Response Code from 404 to 200 in this case. Let me know if this is possible without involving PHP, just via Apache configuration.
Ashish Patil
To change the error code from 404 to 200, you basically use two directives in your Apache configuration: "RewriteCond %{REQUEST_FILENAME} !-f" and "RewriteRule ^.*+ /dummy_reply.xml". The first one makes sure that the second one only applies if the request could not be resolved to an existing file. The second one accesses a canned reply.
Codo
Thanks Codo, now the system is working fine.
Ashish Patil
+1  A: 

Another thing you could look at is Pomegranate, which seems very similar to what you are trying to do.

Josh K