Best strategy for storing documents in SQL Server 2008

views:

answers:

+5 Q:

Best strategy for storing documents in SQL Server 2008

One of our teams is going to be developing an application to store records in a SQL2008 database and each of these records will have an associated PDF file. There is currently about 340GB of files, with most (70%) being about 100K, but some are several Megabytes in size. Data is mostly inserted and read, but the files are updated on occasion. We are debating between the following options:

Store the files as BLOBs in the database.
Store the files outside the database and store the paths in the database.
Use SQL2008's Filestream feature to store the files.

We have read the Micrsoft best practices regarding filestream data, but since the files vary in size, we are not sure which path to choose. We are leaning toward option 3 (filestream), but have some questions:

Which architecture would you choose given the amount of data and file sizes noted above?
Data access will be done using SQL authentication, not Windows authentication, and the web server will likely not be able to access the files using Windows API. Would this make filstream perform worse than the other two options?
Since the SQL backups include the filestream data, this would lead to very large database backups. How do others handle backing up databases with a large amount of filestream data?

Store the files outside the database and store the paths in the database.

because it takes too much space to store files in the database.

Beth 2010-09-30 18:14:30

What about the filestream option?

NYSystemsAnalyst 2010-09-30 18:17:06

Yeah. That really is worth it. Good post, beth - bad argument. Good you donÄt work for me.

TomTom 2010-09-30 18:21:17

seen the whitepaper? it discusses limitations: http://msdn.microsoft.com/en-us/library/cc949109%28SQL.100%29.aspx

Beth 2010-09-30 18:26:06

@TomTom- ditto.

Beth 2010-09-30 20:10:00

I would definitely recommend (3) - this is the sort of scenario that this feature is specifically built to handle, and it is handled very well in my opinion.

This white paper has lots of useful information - http://msdn.microsoft.com/en-us/library/cc949109(SQL.100).aspx - and from a security point of view mentions that...

There are two security requirements for using the FILESTREAM feature. Firstly, SQL Server must be configured for integrated security. Secondly, if remote access will be used, then the SMB port (445) must be enabled through any firewall systems.

With regard to Backups, see the accepted answer to this question - http://stackoverflow.com/questions/1419546/sql-server-filestream-limitation

barrylloyd 2010-09-30 18:22:54

+2 A:

OK, here we go. 1 is a really bad idea - you end up with untestable integrity constraints and backups that are not guaranteed to be consistent per definition because you can not take point in time backups. Not a problem in MOST scenarios, it turns into one the moment you have a more complicated (point in time) recovery.

2 and 3 are pretty equal, albeit with some implications.

Filestream can use a lot more disc space. Basically, every version has a guid, if you make updates the old files stay around until the next backup.
OTOH the files do not count as db size (express edition - not against the 10gb limit should you use it) and access is further down possible using a file share. This is added flexibility.
In database has the most limited options regarding access (no way for the web server to just open the file after getting the path from the sql - it has to funnel the complete file through the sql protocol layer) but has advantages in regards of having less files (numbers). Putting the blobs into a separate table and that one a separate set of spindles may be strategically a good idea.

Regarding your questions:

1: I would go with in database storage. Try out both - filestream and not. As you use the same API anyway, this is a simple change in the table definition.

2: Yes, worse than direct file access, but it would be more protected than direct file access. Otherwise I do not think filestream and blob make a significant difference.

3: where do you have a huge backup here? Sorry to ask, but your 340gb is not exactly a large database. And you need to back it up ANYWAY. Better do it in one consistent state, which is what you achieve with db storage. Plus integrity (no one accidentally deleting unused documents without cleaning up the database). The DB is not significantly larger than doing that split, and it is a simple one place backup.

At the end, the question is db integrity and ease of backing things up. Win for SQL Server unless you get large - and this means 360 terabyte of data.

TomTom 2010-09-30 18:28:27

In terms of backup size...for a normal file system, the files are completely backed up the first time around. For each subsequent backup, only changed files are backed up and unchanged files are just persisted on the backup media. Therefore, 340 GB requires one large initial load, then just small backups after that.However, if we do a full backup on this database even weekly, that's 340GB+ of data backed up each week and retained for a few months (as per our backup retention policy). That will add up to several TB ob backups for one DB - which is a lot for us.

NYSystemsAnalyst 2010-09-30 18:49:52

This is pretty much not true - or also true for SQL Server. You can do differential backups with sql server, too, you know. The behavior is totally the same for both. Plus, seriously, several TB is peanuts. Put in a Raid of 2tb drives and 20 tbg is just a thousand euro in drives.

TomTom 2010-09-30 20:02:15

Are you sure about the point of option 1 (BLOBs) giving inconsistent view and no consistent recovery option? I would say that these are related to option 2 (file system storage)

Remus Rusanu 2010-09-30 20:23:58

Blob and filestream are the same here. As you do a backup once with one tool. Only.

TomTom 2010-09-30 20:31:27

I've used a Index/Content method that you haven't listed but it might help. You have a table of files that are stored as a blob of binary code with a unique id or row number. The next SQL table will provide the index, the name of the file, the path to it, keywords, file type, file size, check sum... what ever you need. This is the best I have have seen to store files for working with thousands of uploaded documents. The index is required to view the file as it would just be binary text to the user if they have no idea what the file type is. We store the data in 2 separate databases to allow the index on one server and the file store on multiple servers for easy expansion. At that point the index table/database contains the name or key to the server the file is on. If the user has access to read that particular index table, then they have access to the file.

Justin 2010-09-30 18:33:43

Have you looked at RBS (Remote Blob Storage) solution? If you use the Filestream RBS provider, it will internally keep your blobs as Filestream files or varbinary(max) values, depending on what gets better performances based on the blob size.

Remote BLOB Store Provider Library Implementation Specification

SQL Remote Blob Storage Team Blog

Pawel Marciniak 2010-10-01 19:52:43

ansaurus

tags:

views:

answers:

Best strategy for storing documents in SQL Server 2008

related questions