views:

395

answers:

3

I've seen a few questions on this topic already but I'm looking for some insight on the performance differences between these two techniques.

For example, lets say I am recording a log of events which will come into the system with a dictionary set of key/value pairs for the specific event. I will record an entry in an Events table with the base data but then I need a way to also link the additional key/value data. I will never know what kinds of Keys or Values will come in so any sort of predefined enum table seems out of the question.

This event data will be constantly streaming in so insert times is just as important as query times.

When I query for specific events I will be using some fields on the Event as well as data from the key/value data. For the XML way I would simply use a Attributes.exists('xpath') statement as part of the where clause to filter the records.

The normalized way would be to use a Table with basically Key and Value fields with a foreign link to the Event record. This seems clean and simple but I worry about the amount of data that is involved.

+3  A: 

You've got three major options for a 'flexible' storage mechanism.

  • XML fields are flexible but put you in the realm of blob storage, which is slow to query. I've seen queries against small data sets of 30,000 rows take 5 minutes when it was digging stuff out of the blobs with Xpath queries. This is the slowest option by far but it is flexible.

  • Key/value pairs are a lot faster, particularly if you put a clustered index on the event key. This means that all attributes for a single event will be physically stored together in the database, which will minimise the I/O. The approach is less flexible than XML but substantially faster. The most efficient queries to report against it would involve pivoting the data (i.e. a table scan to make an intermediate flattened result); joining to get individual fields will be much slower.

  • The fastest approach is to have a flat table with a set of user defined fields (Field1 - Field50) and hold some metadata about the contents of the fields. This is the fastest to insert and fastest and easiest to query, but the contents of the table are opaque to anything that does not have access to the metadata.

ConcernedOfTunbridgeWells
+2  A: 

The problem I think the key/value table approach is regarding the datatypes - if a value could be a datetime, or a string or a unicode string or an integer, then how do you define the column? This dilemma means the value column has to be a datatype which can contain all the different types of data in it which then begs the question of efficiency/ease of querying. Alternatively, you have multiple columns of specific datatypes, but I think this is a bit clunky.

For a true flexible schema, I can't think of a better option than XML. You can index XML columns.

This article off MSDN discusses XML storage in more detail.

AdaTheDev
sql_variant can contain different types, BOL suggests it use for entity-value tables
KM
@KM - yes it can, but it doesn't support all types and is limited to ~8000 bytes - though probably not an issue in this instance. My main concern with it would be what it's like for query performance, and possible type conversion problems - I don't know about that
AdaTheDev
Have you ever actually worked with a non-trivially complex application that stored data as XML columns? I'm working with an insurance policy administration system that does this as we speak. 5 minutes to query 29000 records is an actual system benchmark.
ConcernedOfTunbridgeWells
Yes, before the XML type existed (XML data stored as NTEXT, eek), millions of rows of data. For reporting purposes, this required splitting data out into a correctly typed, set of denormalised tables. It's a trade-off - XML gives the greatest flexibility and I would have thought careful indexing of the XML would result in reasonable performance. However, I'd be interested to see a like-for-like comparison with the data/value pair approach - my main concern with that is the data-typing as I mentioned above.
AdaTheDev
XML columns are slow (ntext columns even more so) on batch jobs, at least when the document is of non-trivial complexity. The system I am working with at the moment does have a shredder that spits out a relational view of the data. The vendor claims about 5-10sec per record to do the shred, which is mostly CPU bound. A KVP with a clustered index minimises I/O for a batch job, but the best way to do this is to pivot the data, rather than joining against itself with filters. Different types can be handled either as a textual representation or a placeholder column for each possible type.
ConcernedOfTunbridgeWells
None of the solutions are elegant. KVPs or flat tables are the most efficient but they are not really elegant. XML requires post-processing or xpath and cross apply queries.
ConcernedOfTunbridgeWells
+1  A: 

I'd assume the normalized way would be faster for both INSERT and SELECT operations, if only because that's what any RDBMS would be optimized for. The "amount of data involved" part might be an issue too, but a more solvable one - how long do you need that data immediately on hand, can you archive it after a day, or a couple weeks, or 3 months, etc? SQL Server can handle an awful lot.

This event data will be constantly streaming in so insert times is just as important as query times.

Option 3: If you really have a lot of data constantly streaming - create a separate queue in shared memory, in-process sqlite, separate db table, or even it's own server, to store the incoming raw event & attributes, and have another process (scheduled task, windows service, etc) parse that queue into whatever preferred format tuned for speedy SELECTs. Optimal input, optimal output, ready to scale in either direction, everyone's happy.

tadamson