views:

271

answers:

10

I'd appreciate some opinions on a concern I have.

I have a [User] table in my database, with the basic stuff you'd expect, like username, password, etc...

This application requires that I track a vast number of attributes for each user. So much so, that I will likely run out of columns (row storage space).

I'm tempted to add a UserProperties table with UserID, PropertyKey and PropertyValue columns. This approach fits well with the requirements.

My concern is that if each user has say 100 properties, when the database has a million users in it, we'll have 100,000,000 property rows.

I would think that with a clustered index on the UserID, that access will still be screaming fast, and you are really storing about the same amount of data as you would with the mega-columns approach.

Any ideas or thoughts on performance concerns? Ideas for a better DB design?

Thanks!

UPDATE:

First, thanks so much for all the great responses!

I have been toying around with the possibilities, and one thing keeps bothering me. I need to query on some of these attributes pretty frequently, and worse yet, these queries could involve finding all users who match criteria on as many as 10 of these attributes at the same time.

As a result, I am now leaning towards the mega-column approach, but possibly splitting the data off into one (or more) separate tables, forming a one-to-one relationship keyed on the UserID.

I'm using LinqToSql, and while I think tables with this many columns are inelegant, I think considering all the challenges and trade-offs, it is probably the right one, but I am still eager to hear other opinions.

+2  A: 

The UserProperties table approach is how I would model it. As you suggested, a clustered index on userID will mean range searches on userID will be fast (i.e. for all properties relating to a single user). Might also add a non-clustered index on UserID and PropertyKey for single key-2-value selects per user.

Mitch Wheat
A: 

I doubt you'd have so many 1-to-1 data values in the Users table that you'll run out of row space. You should only offload 1-to-many values into another table, using the user ID as the Foreign Key. I find it unlikely that your user table will require so many VARCHAR() fields that can't somehow be turned into FKs from a master table of values. What kind of user attributes are you maintaining?

HardCode
A: 

Any way to logically group the properties? You may not always need to access every property. Also, if they're logically grouped it'll be easier to make sense of what properties are available, where new properties fit, etc...

The groupings can have a one to one or one to many relationship with the user...

Arnshea
grouping could be added by adding a groupID to UserProperties table
Mitch Wheat
Mitch, I don't think that was the point. I think it was to cut down on table size.
Mark
The 'right' indexes make table size less of an issue
Mitch Wheat
1 problem with dumping them into a single properties table is finding relationships among them. You're stuck with self-joins or nested subqueries. Another is making sense of the model after the 200th property has been added or renamed. Entities should be modeled as tables not flattened into NVPs.
Arnshea
I don't mean to say a UserProperties table won't work. It will. Just like you don't absolutely need objects; you could use coding conventions to get the job done. The problem is that us humans make mistakes, forget the conventions, etc...
Arnshea
For practical problems; how do I figure out what to name the 201st property? (naming collisions) What if there's code expecting the 195th property but nothing has caused it to be inserted into the table when I run my query to figure out what names are in use?
Arnshea
A: 

We have implemented the UserProperties strategy in several projects. This is a common pattern and with appropriate indexes we have never run into a performance problem.

Another advantage is that you can have two or more properties tables if needed to manage user access. For instance, general properties could be in the PublicUserProps table, while sensitive information (I don't know what you are storing, but ssn's, payroll information, etc) could be in an ControlledUserProps table to which only some users would have read or edit rights.

Gary.Ray
A: 

I like the meta-table approach that Mitch Wheat and yourself described. But if you have a few fields that will be used more frequently than others (such as name, etc) then you might find that having those in the User table could be beneficial and then link up the user table to the UserProperties. I guess it all depends on the exact details of your design.

Joe Philllips
the right indexes and caching should take care of any performance issues realating to frequently used data
Mitch Wheat
A: 

A few options I can think of:

  • bit fields: you can store many values in there and you can add more fields as needed or even use a separate table
  • put the most common settings in the user table and the settings that every user might not have in a second table
  • only store the settings that are different from the default
Darryl Hein
A: 

given the stated constraints, i don't think you really have any other choice!

ok, you could split the user attributes across multiple tables sharing the same UserId as their primary key (and clustered index), but this may or may not improve performance

if you're only talking about 100 attributes, this could be handled by one table (in MS-SQL max is 1023 non-key columns); if the attributes are only sparsely populated then a user-attribute table may potentially be more space-efficient (only your profiler knows for sure)

Steven A. Lowe
+8  A: 

What you're describing is an Entity-Attribute-Value database, which is often used for exactly th situation you describe, sparse data tied to a single entity.

An E-A-V table is easy to search. The problem isn't finding rows, it's finding related rows.

Having different tables for different entities provides domain modeling, but they also provide a weak form of metadata. In E-A-V there are no such abstractions. (The Java analogy to E-A-V would be declaring that all functions' formal arguments were of type Object -- so you'd get no type-checking.)

We can easily look up the property keys, but nothing groups these property keys.

Wikipedia has a very good article on E-A-V, but read it now -- it's mostly the work of one author, and is slated for "improvement".

tpdi
I couldn't agree more. The Entity-Relational model is powerful once you get a handle on it. I feel for the poor guy who inherits a database with entities flattened into name/value pairs and has to make heads or tails of it.
Arnshea
A: 

I would think that with a clustered index on the UserID, that access will still be screaming fast, and you are really storing about the same amount of data as you would with the mega-columns approach.

I think that with a properties table you will end up storing much more than what you actually need. Namely the extra index of the user id, the property key column, and the fact that the property value will need to be able to handle values of a generic nature making it difficult to optimise.

My advice is to try to put it all in one table and then fire it up with some test data. IF it doesn't work then you could go down the path of a multiple table solution or even a non-db solution (they aren't silver bullets after all).

CurtainDog
+4  A: 

I recommend that you consider the approach known as vertical partitioning. This means that you keep defining tables with a UserID key, you could call them User1, User2, etc. Start a new table when you hit the maximum row size for your database. The benefit of this approach is that the values are still true database attributes. This will wind up saving time when working with this data, e.g. data binding.

The key question to answer is: are these really attributes? Do they represent the struture of information that you must collect about the user. If so, the best way to model them is to make them columns. The only reason you must resort to vertical partitioning is the row size limit of the database.

If, on the other hand, a flexible attribute system is called for, then by all means go with the property key/property value system. For example, if users were allowed to define their own attributes dynamically, then you'd definitely want the key/value system. However, I would say key/value is not the best way if you understand the structure of your data and have legitimately identified hundreds of attributes for users.

As a side note, I must say that you should question entities with large numbers of attributes. They may be valid, but it's also quite likely that you're missing some entities at the conceptual level. In other words, mabe all of these attributes don't related to the user per se, but to some other entity that is related to users.

Paul Keister