views:

66

answers:

3

I am building a website which depends on serving lots of little mp3 files (approx 10-15KB each) quite quickly. Each file contains a word pronunciation, and 20-30 per user will be downloaded every minute they are using the site. Each user might download 200 a day, and I anticipate 50 simultaneous users. There will be approx. 15,000 separate file eventually.

What would be the best way to store, manage, call and play these files as required? Will I need specialist hosting to deal with all the little files, or will they behave happily in one big folder (using a standard host)? Any delays will ruin the feel.


Update

Having done a bit more searching, I think the problem could be solved with either:

  1. A service like Photobucket but for audio instead, with its own API
  2. Some other sort of 'bucket hosting' solution where you can upload thousands of files at a reasonable cost, and call for them easily

Does anyone know of such a product?

A: 

I would serve these from an in memory database 15ksize * 15000 = 225Mb of raw data, even with significant overhead it will easily fit in a medium hosting plan. The disk based caches might be elegant here, e.g. memcachedb, ehcache or similar then you only have one API and some configuration.

You should warm up the cache though on startup.

The metadata can be in a mysql or similar. You might keep a mastercopy there too for easier management and as a backend for the cache.

Peter Tillemans
Thanks Peter. Is it possible to run things like memcached on commercial hosts, or are those kinds of things for servers people maintain themselves on the whole?
Patrick Beardmore
Yes you can run these on virtual private servers. Many hosters provide preconfigured software packages which you can select and drop in. Ehcache is a java library, so it is part of the application. I am sure other languages have similar libraries.
Peter Tillemans
+1  A: 

If you want (or need) to store the files on disk instead of as BLOBs in a database, there are a couple of things you need to keep in mind.

Many (but not necessarily all) file systems don't work too well with folders containing many files, so you probably don't want to store everything in one big folder - but that doesn't mean you nead specialist hosting.

The key is to distribute the files into a folder hierarchy, based on some hash function. As an example, we'll use the MD5 of the filename here, but it's not particularly important which algorithm you use or what data you are hashing, as long as you're consistent and have the data available when you need to locate a file.

In general, the output of a hash function is formatted as a hexadecimal string: for example, the MD5 of "foo.mp3" is 10ebb1120767e9de166e0f5905077cb1.

You can create 16 folders, one for each of possible hexadecimal characters - so you have a directory 0, one named 1, and so on up to f.

In each of those 16 folders, repeat this structure, so you have two levels. (0/0/, 0/1/,... , f/f/)

What you then do is simply to place the file in the folder dictated by its hash. You can use the first character to determine the first folder, and the second character to determine the subfolder. Using that scheme, foo.mp3 would go in 1/0/, bar.mp3 goes in b/6/, and baz.mp3 goes in 1/b/.

Since these hash functions are intended to distribute their values evenly, your files will be distributed fairly evenly across these 256 folders, which reduces the number of files in any single folder; statistically, 15000 files would result in an average of nearly 60 per folder which should be no problem.

If you're unlucky and the hash function you chose ends up clumping too many of your files in one folder anyway, you can extend the hierarchy to more than 2 levels, or you can simply use a different hash function. In both cases, you need to redistribute the files, but you only need to do that once, and it shouldn't be too much trouble to write a script to do it for you.

For managing your files, you will likely want a small database indexing what files you currently have, but this does not necessarily need to be used for anything other than managing them - if you know the name of the file, and you use the filename as input to your hash function, you can just calculate the hash again and find its location that way.

Michael Madsen
I love the detail in this answer, and I think I will try this method.Is there an easy way to upload my files, in bulk, into these folders (assuming I have written the appropriate code), because I'm not looking forward to sorting them individually to start with.
Patrick Beardmore
An additional benefit of your system is that the files are well and truly hidden, which is helpful as I want to try and make downloading them as hard as possible for someone trying to steal the collection.
Patrick Beardmore
@Patrick: When uploading many files at once, you'll probably want to upload everything to a temporary folder using FTP and then write a small script which goes through that folder and sorts them - or you can pre-sort them locally and use a good FTP program, e.g. FileZilla to upload the entire directory structure. If it's possible on your server, you could also put everything in an archive and then upload that through a form; then, on the server, you extract that archive and distribute the files as necessary.
Michael Madsen
+1  A: 

15k Files in one directory should not be a problem for any modern file system. It surely isn't for NTFS. What you don't want to do is open up a folder that contains 100k+ files in explorer or something similar, because populating the list-box (GUI) is a killer. Also you wouldn't want to iterate over the contents of such a folder repeatedly. However just accessing a file if you know the filename (path) is still very fast, and a server usually does just that.

The frequency doesn't sound too scary too. 50 users * 30 requests/minute/user is 25 requests per second. That's not something you can ignore completely, but any decent web-server should be able to serve files at that rate. Also I see no need for a specialized in-memory server/database/data-store. Every OS has a file-cache, and that should take care of having frequently accessed files in memory.

If you must guarantee low (worst-case) latency, you might still need an in-memory data-store. But then again if you must guarantee latency, things become complicated anyway.

One last thing: think about reverse proxies. I find it very convenient to be able to primarily store/update data in just one place (of my choosing), and have reverse proxies take care of the rest. If your files never change (i.e. same URL means same data) this is a very easy way to provide really good scalability. If the files indeed can chance, just make it so that they can not :) e.g. by encoding the change-date into the filename (and deleting the "old versions").

pgroke
I'm trying to square the circle between your analysis, and Michael Madsen's answer (http://stackoverflow.com/questions/3843747/serving-lots-of-small-files/3843845#3843845).You're quite right, I don't need to guarantee zero latency, but the site will be useless if it is to long.You say that it is only the GUI aspect of file systems that cause delays when dealing with lots of files. Occasionally, I'll want to look at these files via FTP. Will that cause the same problems.
Patrick Beardmore
It's not only the GUI part, but the GUI part is usually the worst. However, if you want to look at the files manually, you should probably organize them into multiple directories anyway, for clarity's sake. And by organize I don't mean sharding by MD5 like Michael suggested, but maybe by using the first character of the filename for the first directory level and the second character for the second level, e.g. "data-store/w/h/what.mp3". Might be more convenient if you want to inspect the files manually.
pgroke