views:

254

answers:

1

I have a growing set of Excel spreadsheets none with the same data structure. I need a mechanism to query each of these spreadsheets (DataTables) using a single interface. So essentially, you choose the DataTable from a dropdownlist and then perform your search.

My initial thought was to handle it like this.

  • Create a generic data structure in SQL. This would be comprised of four tables. 1) Study 2) StudyProperty 3) StudyPropertyType (a static list of datatypes) and 4) StudyData.
  • The first step would be to define a study. Meaning that you create an entry in the Study table, to give it a name and description. Joined to that you would create a StudyProperty record for each corresponding field in the Excel spreadsheet, and select the appropriate StudyPropertyType.
  • So if you had a spreadsheet with two columns (Name and Date of Birth), you'd get one Study record, and two StudyProperty records (Name with the StudyPropertyType = string) and (Date of Birth with the StudyPropertyType = DateTime)
  • Once the Study has been defined, I can import the data from the Excel spreadsheet into the StudyData table.
  • The StudyData table has an Id and a column for each possible datatype. Depending on the Study definition, the imported data is stored in the appropriate field in the Study table based on dataType. So Date Of Birth would be stored in the ValueDateTime field of the Study table.

So with all that said and done, I can have all the spreadsheets stored in SQL, but now I am stumped on how to build a generic query mechanism.

I feel like I have made this much more complicated than need be, and I'm hoping there's a better way to handle unknown data structures.

+1  A: 

SQL (and RDBMS in general) do not handle unknown data structures well at all. Typically, they violate all the traditional definitions of a relational database.

The ever-so-tempting schema you are talking about with flexible properties is known as EAV (Entity-Attribute-Value) or database-within-a-database and can be used successfully in SQL/databases (when used very carefully) but most often is simply a recipe for disaster. There are plenty of questions about EAV here on StackOverflow.

Cases where I've use it successfully are not for ad hoc querying, but are for settings where I want to have arbitrary settings on an entity and their non-existence will fall back to a default (and possible a default-default - you see, that's why EAV is dangerous!) Typically, however, there is extra code in the app (or stored procedures) which is aware of the settings, but the database is unaware. That phrase alone should give you a clue as to why this is not a good (database) practice. When I have used it, there has been an overriding architectural concern. Using it prevents the database from being able to manage its data (particularly a weak point is data types) and ensure integrity/perimeter contracts because it knows very little about it. So typically, I pair it with SPs/views/UDFs to give the database back as much control as I can.

There is a close relative of EAV which is related to data warehousing and statistical performance. In these cases, there are usually several dimensions - business unit, time, geographical region, G/L division, etc. and then a measurement code and a measurement value (typically MONEY). So for 1/1/2001 for a particular business unit etc., you might have an expense measurement with code 1 for Expenses, code 2 for Revenue, etc. This suffers from many of the same drawbacks as EAV and the justification of it is that you can add measurements by adding rows instead of changing a schema (on a table which is likely to have billions of rows). Also metrics can come and go over time, and this is a good representation, it handles roll-up well etc.

I would strongly consider before going for an EAV implementation - going back to your requirements/use cases and see if there is an alternative - even analyzing the spreadsheet (using the Excel object model) and building tables in the database to match and then allowing ad hoc queries against those individual tables might be easier.

Cade Roux
Thank you very much for our thorough explanation. I think I may go with your suggestion of creating database tables to match the imported spreadsheet.
Adam