views:

322

answers:

3

We are building a solution for document storage and for each document we need to store a lot of extra metadata with it to comply with local regulations, ranging from basic data like title or description to dates of relevant events or disposition and classification rules.

I've seen different types of solutions, but none convinces me:

  1. Tables that grow in columns when a new metadata slot is added (so they have as many columns as metadata associated with the documents)
  2. Tables with a lot of spare generic columns. Very similar to 1. but the tables don't grow (less permissions)
  3. A table of document ids, metadata keys and metadata values.
  4. A table with metadata definitions and metadata keys in 3. are substituted by metadata ids. We used this solution in the past. The tables have millions of rows at the end.
  5. A text field in the document table or associated table that stores a XML or other structured information with all the metadata in key-value pairs.

I'm biased towards number 5, providing a parallel full-text index (Lucene.Net? Other?) to search by relevant metadata (not everything has to be "searchable").

Any suggestion? Similar experiences?

+1  A: 

Table 1: Document information (PK is document ID)

Table 2: Metadata definitions (PK is metadata definition ID)

Table 3: Document ID, Metadata defintion ID, metadata value

The biggest drawback to this is that you'd either have to have a single type (varchar, presumably), or you'd have to have n columns (where n is the number of data types you're willing to store), and use a column in the metadata definitions table to identify which column in table 3 to pull the value from.

My opinions on the 5 solutions listed:

  1. Growing tables is a pain, and could cause issues down the line (particularly if you want/need a non-nullable metadata value).
  2. I hate 'spare generic columns' with a passion (even though they're popular).
  3. Close, but this limits your metadata flexibility even more than my solution. If your metadata keys and values are fairly basic, it might work.
  4. I'm not really sure what you mean by this one - is it the same as I'm proposing, or something else?
  5. I don't like storing structured XML in an RDBMS - you lose most of the power of the RDBMS by doing this IMHO.

That's my thoughts - I've never designed a system like this, but I have dealt with commercial systems that have used several of these schemes.

Harper Shelby
Yes, number 2 is popular (i.e. Sharepoint) but I agree with you, is an awkward solution.
Marc Climent
I accept this as a response. Number 4 is what Harper proposes and it's a good solution from the RDBMS point of view. I think I'll mix that (which is what we actually have) with an index and search engine that takes care of the relevant metadata.
Marc Climent
+1  A: 

Why not use CouchDB? Its designed precisely to address this type of requirement.

If that is not an option, consider using Lua or JSon (per your #5 option) as the meta-data descriptor.

I see your point but this project should not rely on third parties to store de information. It's a bit of NIH but it's a core business function of our product range.
Marc Climent
+1  A: 

Maybe you can take a look at JCR(Java Content Repository). JCR is a standard for content repository which captures the common requirements of content management like versioning, full-text search and edit. Also it provides a level of abstract on the content storage, which means you can use one API to put contents into any kind of storage system like database, xml file, etc. Of course you can add metadata to your document by adding some properties to document node with JCR API. You don't have to worry about how the document and metadata will be stored. JCR will take care of it. Jackrabbit is the reference implementation of JCR. Have a try.

yanky
Actually JCR is very interesting but I haven't found anything like that in the .NET world (and porting it is not an option).
Marc Climent