views:

57

answers:

1

I'd like advice on the following design. Is it reasonable? Is it stupid/insane?

Requirements:

  • We have some distributed calculations that work on chunks of data that are sometimes up to 50Mb in size.
  • Because the calculations take a long time, we like to parallelize the calculations on a small grid (around 20 nodes)
  • We "produce" around 10000 of these "chunks" of binary data each day - and want to keep them around for up to a year... Most of the items aren't 50Mb in size though, so the total daily space requirement is more around 5Gb... But we'd like to keep stuff around for as long as possible, (a year or more)... But hey, you can get 2TB hard disks nowadays.
  • Although we'd like to keep the data around, this is essentially a "cache". It's not the end of the world if we lose data - it just has to get recalculated, which just takes some time (an hour or two).
  • We need to be able to efficiently get a list of all "chunks" that were produced on a particular day.
  • We often need to, from a support point of view, delete all chunks created on a particular day or remove all chunks created within the last hour.
  • We're a Windows shop - we can't easily switch to Linux/some other OS.
  • We use SQLServer for existing database requirements.
  • However, it's a large and reasonably bureaucratic company that has some policies that limit our options: for example, conventional database space using SQLServer is charged internally at extremely expensive prices. Allocating 2 terabytes of SQL Server space is prohibitively expensive. This is mainly because our SQLServer instances are backed up, archived for 7 years, etc. etc. But we don't need this "gold-plated" functionality because we can just recreate the stuff if it goes missing. At heart, it's just a cache, that can be recreated on demand.
  • Running our own SQLServer instance on a machine that we maintain is not allowed (all SQLServer instances must be managed by a separate group).
  • We do have fairly small transactional requirement: if a process that was producing a chunk dies halfway through, we'd like to be able to detect such "failed" transactions.

I'm thinking of the following solution, mainly because it seems like it would be very simple to implement:

  • We run a web server on top of a windows filesystem (NTFS)
  • Clients "save" and "load" files by using HTTP requests, and when processes need to send blobs to each other, they just pass the URLs.
  • Filenames are allocated using GUIDS - but have a directory for each date. So all of the files created on 12th November 2010 would go in a directory called "20101112" or something like that. This way, by getting a "directory" for a date we can find all of the files produced for that date using normal file copy operations.
  • Indexing is done by a traditional SQL Server table, with a "URL" column instead of a "varbinary(max)" column.
  • To preserve the transactional requirement, a process that is creating a blob only inserts the corresponding "index" row into the SQL Server table after it has successfully finished uploading the file to the web server. So if it fails or crashes halfway, such a file "doesn't exist" yet because the corresponding row used to find it does not exist in the SQL server table(s).
  • I like the fact that the large chunks of data can be produced and consumed over a TCP socket.

In summary, we implement "blobs" on top of SQL Server much the same way that they are implemented internally - but in a way that does not use very much actual space on an actual SQL server instance.

So my questions are:

  1. Does this sound reasonable. Is it insane?
  2. How well do you think that would work on top of a typical windows NT filesystem? - (5000 files per "dated" directory, several hundred directories, one for each day). There would eventually be many hundreds of thousands of files, (but not too many directly underneath any one particular directory). Would we start to have to worry about hard disk fragmentation etc?
  3. What about if 20 processes are all, via the one web server, trying to write 20 different "chunks" at the same time - would that start thrashing the disk?
  4. What web server would be the best to use? It needs to be rock solid, runs on windows, able to handle lots of concurrent users.

As you might have guessed, outside of the corporate limitations, I would probably set up a SQLServer instance and just have a table with a "varbinary(max)" column... But given that is not an option, how well do you think this would work?

This is all somewhat out of my usual scope so I freely admit I'm a bit of a Noob in this department. Maybe this is an appalling design... but it seems like it would be very simple to understand how it works, and to maintain and support it.

A: 

Your reasons behind the design are insane, but they're not yours :)

NTFS can handle what you're trying to do. This shouldn't be much of a problem. Yes, you might eventually have fragmentation problems if you run low on disk space, but make sure that you have copious amounts of space and you shouldn't have a problem. If you're a Windows shop, just use IIS.

I really don't think you will have much of a problem with this architecture. Just keep it simple like you're doing and things should be fine.

Dave Markle
Thanks for the vote of confidence.
Paul Hollingsworth
"unaccepted" the answer - no offence, but hoping this might get more views/comments... I'd really like to get some more advice on this.
Paul Hollingsworth
what, are you having people talking smack about your solution at the office?
Dave Markle
heheh no we are going ahead with it. Marking it accepted again.
Paul Hollingsworth