views:

148

answers:

3

I have a design decision to make regarding documents uploaded to my web site: I can either store them on my file server somewhere, or I can store them as a blob in my database (MSSQL 2005). If it makes any difference to the design decision, these documents are confidential and must have a certain degree of protection.

The considerations I've thought of are:

  1. Storing on the file server makes for HUUUUUUUGE numbers of files all dumped in a single directory, and therefore slower access, unless I can work out a reasonable semantic definition for a directory tree structure
  2. OTOH, I'm guessing that the file server can handle compression somewhat better than the DB... or am I wrong?
  3. My instincts tell me that the DB's security is stronger than the file server's, but I'm not sure if that's necessarily true.
  4. Don't know how having terabytes of blobs in my DB will affect performance.

I'd very much appreciate some recommendations here. Thanks!

+3  A: 

In SQL Server 2005, you only have the choice of using VARBINARY(MAX) to store the files inside the database table, or then keep them outside.

The obvious drawback of leaving them outside the database is that the database can't really control what happens to them; they could be moved, renamed, deleted.....

SQL Server 2008 introduces the FILESTERAM attribute on VARBINARY(MAX) types, which allows you to leave the files outside the database table, but still under transactional control of the database - e.g. you cannot just delete the files from the disk, the files are integral part of the database and thus get copied and backed up with it. Great if you need it, but it could make for some huge backups! :-)

The SQL Server 2008 launch presented some "best practices" as to when to store stuff in the database directly, and when to use FILESTREAM. These are:

  • if the files are typically less than 256 KB in size, the database table is the best option
  • if the files are typically over 1 MB in size, or could be more than 2 GB in size, then FILESTREAM (or in your case: plain old filesystem) is your best choice
  • no recommendation for files between those two margins

Also, in order not to negatively impact performance of your queries, it's often a good idea to put the large files into a separate table alltogether - don't have the huge blobs be part of your regular tables which you query - but rather create a separate table, which you only ever query against, if you really need the megabytes of documents or images.

So that might give you an idea of where to start out from!

marc_s
+ answer credit: among some other very good answers, this was the most balanced one with the biggest picture. Thanks!
Shaul
+1  A: 

There's a LOT of "it depends" behind this popular subject. Since you say the documents are sensitive and confidential, off the cuff I'd go with storing in the database. Here are a few reasons:

  • Potentially better security. It is often easier to hack a file system than a database.
  • Better volume control. Thousands of files in one folder can strain an OS, where a database can take millions of rows in one table without blinking.
  • Better searching and scanning. Add categorizing columns when you load the data, or try out full text indexing to scan the actual documents.
  • Backups may be more efficient -- just add another database to your backup plan, and you're covered (once you work out space details, of course). And those backup files are another layer of obfuscation on anyone trying to get at your sensitive documents.
  • SQL Server 2008 has data compression options that may help here. That, or have the application do it? (More security through obfuscation, perhaps)

SQL Server 2008 also has the filestream data type, which may help here, but I'm not familiar enough with it to give a recommendation for your situation.

Philip Kelley
+1  A: 

I strongly suggest you to consider the filesystem solution. The reasons are:

  • you have better access to the files (precious in case of debugging), meaning that you can use regular console-based tools
  • you can quickly and easily take advantage of the OS to distribute the load, for example using a distributed filesystem, add redundancy via a hardware RAID etc.
  • you can take advantage of the OS access control lists to enforce permissions.
  • you don't clog your database

If you are worried about large amounts of entries in your directories, you can always create a branching schema. for example:

filename : hello.txt
filename md5: 2e54144ba487ae25d03a3caba233da71
final filesystem position: /path/2e/54/hello.txt
Stefano Borini