tags:

views:

370

answers:

8

I have an application that creates records in a table (rocket science, I know). Users want to associate files (.doc, .xls, .pdf, etc...) to a single record in the table.

  • Should I store the contents of the file(s) in the database? Wouldn't this bloat the database?

  • Should I store the file(s) on a file server, and store the path(s) in the database?

What is the best way to do this?

+2  A: 

Use the database for data and the filesystem for files. Simply store the file path in the database.

In addition, your webserver can probably serve files more efficiently than you application code will do (in order to stream the file from the DB back to the client).

cherouvim
+2  A: 

Store the paths in the database. This keeps your database from bloating, and also allows you to separately back up the external files. You can also relocate them more easily; just move them to a new location and then UPDATE the database.

One additional thing to keep in mind: In order to use most of the filetypes you mentioned, you'll end up having to:

  • Query the database to get the file contents in a blob
  • Write the blob data to a disk file
  • Launch an application to open/edit/whatever the file you just created
  • Read the file back in from disk to a blob
  • Update the database with the new content

All that as opposed to:

  • Read the file path from the DB
  • Launch the app to open/edit/whatever the file

I prefer the second set of steps, myself.

Ken White
+2  A: 

You should only store files in the database if you're reasonably sure you know that the sizes of those files aren't going to get out of hand.

I use our database to store small banner images, which I always know what size they're going to be. Your database will store a pointer to the data inside a row and then plunk the data itself somewhere else, so it doesn't necessarily impact speed.

If there are too many unknowns though, using the filesystem is the safer route.

Kevin Laity
+2  A: 

The best solution would be to put the documents in the database. This simplifies all the linking and backingup and restoring issues - But it might not solve the basic 'we just want to point to documents on our file server' mindset the users may have.

It all depends (in the end) on actual user requirements.

BUt my recommendation would be to put it all together in the database so you retain control of them. Leaving them in the file system leaves them open to being deleted, moved, ACL'd or anyone of hundreds of other changes that could render your linking to them pointless or even damaging.

Database bloat is only an issue if you haven't sized for it. Do some tests and see what effects it has. 100GB of files on a disk is probably just as big as the same files in a database.

Brody
The only problem with storing them in the database is that, to use them in any way, you end up having to store them on disk anyway. The process of getting from a blob column to a disk file can be complicated, depending on the DBMS you're using.
Ken White
+1  A: 

I would try to store it all in the database. Haven't done it. But if not. There are a small risk that file names get out of sync with files on the disk. Then you have a big problem.

Flinkman
+4  A: 

I think you've accurately captured the two most popular approaches to solving this problem. There are pros and cons to each:

Store the Files in the DB

Most rbms have support for storing blobs (or binary file data, .doc, .xls, etc.) in a db. So you're not breaking new ground here.

Pros

  • Simplifies Backup of the data: you backup the db you have all the files.
  • The linkage between the metadata (the other columns ABOUT the files) and the file itself is solid and built into the db; so its a one stop shop to get data about your files.

Cons

  • Backups can quickly blossom into a HUGE nightmare as you're storing all of that binary data with your database. You could alleviate some of the headaches by keeping the files in a separate DB.
  • Without the DB or an interface to the DB, there's no easy way to get to the file content to modify or update it.
  • In general, its harder to code and coordinate the upload and storage of data to a DB vs. the filesystem.

Store the Files on the FileSystem

This approach is pretty simple, you store the files themselves in the filesystem. Your database stores a reference to the file's location (as well as all of the metadata about the file). One helpful hint here is to standardize your naming schema for the files on disk (don't use the file that the user gives you, create one on your own and store theirs in the db).

Pros

  • Keeps your file data cleanly separated from the database.
  • Easy to maintain the files themselves (if you need to change out the file or update it), you do so in the file system itself. You can just as easily do it from the application as well via a new upload.

Cons

  • If you're not careful, your database about the files can get out of sync with the files themselves.
  • Security can be an issue (again if you're careless) depending on where you store the files and whether or not that filesystem is available to the public (via the web I'm assuming here).

At the end of the day, we chose to go the filesystem route. It was easier to implement quickly, easy on the backup, pretty secure once we locked down any holes and streamed the file out (instead of just serving directly from the filesystem). Its been operational in pretty much the same format for about 6 years in two different government applications.

J

Jay Stevens
+3  A: 

How well you can store binaries, or BLOBs, in a database will be highly dependant on the DBMS you are using.

If you store binaries on the file system, you need to consider what happens in the case of file name collision, where you try and store two different files with the same name - and if this is a valid operation or not. So, along with the reference to where the file lives on the file system, you may also need to store the original file name.

Also, if you are storing a large amount of files, be aware of possible performance hits of storing all your files in one folder. (You didn't specify your operating system, but you might want to look at this question for NTFS, or this reference for ext3.)

We had a system that had to store several thousands of files on the file system, on a file system where we were concerned about the number of files in any one folder (it may have been FAT32, I think).

Our system would take a new file to be added, and generate an MD5 checksum for it (in hex). It would take the first two characters and make that the first folder, the next two characters and make that the second folder as a sub-folder of the first folder, and then the next two as the third folder as a sub-folder of the second folder.

That way, we ended up with a three-level set of folders, and the files were reasonably well scattered so no one folder filled up too much.

If we still had a file name collision after that, then we would just add "_*n*" to the file name (before the extension), where n was just an incrementing number until we got a name that didn't exist (and even then, I think we did atomic file creation, just to be sure).

Of course, then you need tools to do the occasional comparison of the database records to the file system, flagging any missing files and cleaning up any orphaned ones where the database record no longer exists.

Evan
A: 

And now for the completely off the wall suggestion - you could consider storing the binaries as attachments in a CouchDB document database. This would avoid the file name collision issues as you would use a generated UID as each document ID (which you what you would store in your RDBMS), and the actual attachment's file name is kept with the document.

If you are building a web-based system, then the fact that CouchDB uses REST over HTTP could also be leveraged. And, there's also the replication facilities that could prove of use.

Of course, CouchDB is still in incubation, although there are some who are already using it 'in the wild'.

Evan