views:

370

answers:

3

Hi,

I'm in need of a distributed file system that must scale to very large sizes (about 100TB realistic max). Filesizes are mostly in the 10-1500KB range, though some files may peak at about 250MB.

I very much like the thought of systems like GFS with built-in redundancy for backup which would - statistically - render file loss a thing of the past.

I have a couple of requirements:

  • Open source
  • No SPOFs
  • Automatic file replication (that is, no need for RAID)
  • Managed client access
  • Flat namespace of files - preferably
  • Built in versioning / delayed deletes
  • Proven deployments

I've looked seriously at MogileFS as it does fulfill most of the requirements. It does not have any managed clients, but it should be rather straight forward to do a port of the Java client. However, there is no versioning built in. Without versioning, I will have to do normal backups besides the file replication built into MogileFS.

Basically I need protection from a programming error that suddenly purges a lot of files it shouldn't have. While MogileFS does protect me from disk & machine errors by replicating my files over X number of devices, it doesn't save me if I do an unwarranted delete.

I would like to be able to specify that a delete operation doesn't actually take effect until after Y days. The delete will logically have taken place, but I can restore the file state for Y days until it's actually deleten. Also MogileFS does not have the ability to check for disk corruption during writes - though again, this could be added.

Since we're a Microsoft shop (Windows, .NET, MSSQL) I'd optimally like the core parts to be running on Windows for easy maintainability, while the storage nodes run *nix (or a combination) due to licensing.

Before I even consider rolling my own, do you have any suggestions for me to look at? I've also checked out HadoopFS, OpenAFS, Lustre & GFS - but neither seem to match my requirements.

A: 

You could try running a source control system on top of your reliable file system. The problem then becomes how to expunge old check ins after your timeout. You can setup an Apache server with DAV_SVN and it will commit each change made through the DAV interface. I'm not sure how well this will scale with large file sizes that you describe.

Brian C. Lane
I am rather sure this will not scale to what I need. Furthermore, there'll be a lot of extra coding work needed to have the deletes actually performed - they shall not be stored forever.
Mark S. Rasmussen
+1  A: 

Do you absolutely need to host this on your own servers? Much of what you need could be provided by Amazon S3. The delayed delete feature could be implemented by recording deletes to a SimpleDB table and running a garbage collection pass periodically to expunge files when necessary.

There is still a single point of failure if you rely on a single internet connection. And of course you could consider Amazon themselves to be a point of failure but the failure rate is always going to be far lower because of scale.

And hopefully you realize the other benefits, the ability to scale to any capacity. No need for IT staff to replace failed disks or systems. Usage costs will continually drop as disk capacity and bandwidth gets cheaper (while disks you purchase depreciate in value).

It's also possible to take a hybrid approach and use S3 as a secure backend archive and cache "hot" data locally, and find a caching strategy that best fits your usage model. This can greatly reduce bandwidth usage and improve I/O, epecially if data changes infrequently.

Downsides:

  • Files on S3 are immutable, they can only be replaced entirely or deleted. This is great for caching, not so great for efficiency when making small changes to large files.
  • Latency and bandwidth are those of your network connection. Caching can help improve this but you'll never get the same level of performance.

Versioning would also be a custom solution, but could be implemented using SimpleDB along with S3 to track sets of revisions to a file. Overally, it really depends on your use case if this would be a good fit.

Mark Renouf
A: 

@tweakt
I've considered S3 extensively as well, but I don't think it'll be satisfactory for us in the long run. We have a lot of files that must be stored securely - not through file ACL's, but through our application layer. While this can also be done through S3, we do have one bit less control over our file storage. Furthermore there will also be a major downside in forms of latency when we do file operations - both initial saves (which can be done asynchronously though), but also when we later read the files and have to perform operations on them.

As for the SPOF, that's not really an issue. We do have redundant connections to our datacenter and while I do not want any SPOFs, the little downtime S3 has had is acceptable.

Unlimited scalability and no need for maintenance is definitely an advantage.

Regarding a hybrid approach. If we are to host directly from S3 - which would be the case unless we want to store everything locally anyways (and just use S3 as backup), the bandwidth prices are simply too steep when we add S3 + CloudFront (CloudFront would be necessary as we have clients from all around). Currently we host everything from our datacenter in Europe, and we have our own reverse squids setup in the US for a low-budget CDN functionality.

While it's very domain dependent, ummutability is not an issue for us. We may replace files (that is, key X gets new content), but we will never make minor modifications to a file. All our files are blobs.

Mark S. Rasmussen