views:

316

answers:

2

I need to store binary files in a varbinary(max) column on SQL Server 2005 like this:

FileInfo

  • FileInfoId int, PK, identity
  • FileText varchar(max) (can be null)
  • FileCreatedDate datetime etc.

FileContent

  • FileInfoId int, PK, FK
  • FileContent varbinary(max)

FileInfo has a one to one relationship with FileContent. The FileText is meant to be used when there is no file to upload, and only text will be entered manually for an item. I'm not sure what percentage of items will have a binary file.

Should I create the second table. Would there be any performance improvements with the two table design? Are there any logical benefits?

I've found this page, but not sure if it applies in my case.

+1  A: 

There is no real logical advantage to this two-tables design, since the relationship is 1-1, you might have all the info bundled in the FileInfo table. However, there are serious operational and performance advantages, in particular if your binary data is more than a few hundred bytes in size, on average.

EDIT: As pointed out by Remus Rusanu, on some DBMS implementations such as SQL2005, the large object types are transparently stored to a separate table, effectively alleviating the practical drawback of having big records. The introduction of this feature implicitly confirms the the [true] single table approach's weakness.

I merely scanned the SO posting referenced in this question. I generally thing that while that other posting makes a few valid points, such as intrinsic data integrity (since all CRUD actions on a given item are atomic), but on the whole, and unless of relatively atypical use cases (such as using the item table as a repository mostly queried for single items at a time), the performance advantage is with the two tables approach (whereby indexes on "header" table will be more efficient, queries that do not require the binary data will return much more quickly etc. etc.)

And the two tables approach has further benefits in case the design evolves to supply different types of binary objects in differnt context. For example, say these items are images (GIFs, JPGs etc.). At a later date you want to also provide a small preview version of these images (and/or a hi-resolution version), the choice of this being driven by the context (user preference, low band-width clients, subscriber vs. visitor etc.). In such a case not only are the operational issues associated with the single table approach made more acute, the model becomes more versatile.

mjv
I can only see a problem if you tend to use `SELECT *` habitually instead of `SELECT`ing only the columns you need.
Joey
Which is bad practice in itself. Fix for that is to stop using SELECT *, not change the database design.
GilaMonster
One can argue that the single table design is good because the slow queries associated with [unnecessary] SELECT * will eventually force the developer to avoid the * practice. :-) I think however that my opinion favoring the two-tables approach holds independently of the SELECT * practice.
mjv
I agree about SELECT *, and I never use it. I guess what I was aiming at is exactly related to indexes and reading of the other columns, would that consume less cpu and memory if we had the varbinary column in a separate table? Let's assume I don't include the varbinary column in the select statement.
Dragoljub
+4  A: 
Remus Rusanu
Thanks! This is exactly the info I was looking for. It's clear now.But I wonder... Perhaps the 1-1 separation method goes back from the days of SQL Server 2000 when varbinary(max) didn't yet exist in the form it does in SQL Server 2005, and text, ntext and image were used for storing binary files? Why else would people have that idea? Or is it really only a principle to prevent people to make the SELECT * mistake? But they can still use a join :)
Dragoljub
+1 Remus, I forgot about this 2005 and up feature. Regarding FILESTREAM in 2008, I wouldn't rush using this for any respectably sized repository. At the cost of having to maintain the referential integrity of the "header data" in relation to a plain file-system repository, this type of approach does offer several operational advantages in terms of scaling, backup, mirroring, TCO...
mjv
@mjv: FILESTREAM has some advantages if the file has to be accessed from a Win32 API (eg. document integration like SharePoint). Backup/Restore contain filestream content, but is true that it does not work with mirroring. As any new feature, the tricks of how to scale it up will take a while to percolate into common knowledge, but I know it *can* be scaled. On TCO, I'm note sure I understant what cost are you reffering to. Deployment setup cost and training perhaps?
Remus Rusanu
@Dragoljub: Frankly, I'm not sure how that advice about separation of LOBs got started. Pre 2005 the text/ntext/image fields were still stored outside the table (a mess of a hairball storage, but not part of the clustered leaf pages, so no part of a scan/seek). I can only think at DBAs trying to avoid the effect of * in projection lists.
Remus Rusanu