views:

36

answers:

3

i'm faced with having to add 64 new columns to table that already had 32 columns. For examples sake:

Customers
(
    CustomerID int
    Name        varchar(50)
    Address     varchar(50)
    City        varchar(50)
    Region      varchar(50)
    PostalCode  varchar(50)
    Country     varchar(2)
    Telephone   varchar(20)

    ...
    NewColumn1  int null
    NewColumn2  uniqueidentifier null
    NewColumn3  varchar(50)
    NewColumn4  varchar(50)
    ...
    NewColumn64 datetime null

    ...
    CreatedDate datetime
    LastModifiedDate datetime
    LastModifiedWorkstation varchar(50)
    LastModifiedUser varchar(50)
)

Most of the time the majority of these new columns will contain null.

It is also a given that if i vertically partition off these 64 new columns into a new table, then every time i SELECT from Customers:

SELECT ...
FROM Customers

will have to be converted to a join to get the partitioned values (i.e. there is never a performance gain to be had where i don't require the new columns):

SELECT ...
FROM Customers
    INNER JOIN Customers_ExtraColumns
    ON Customers.CustomerID = Customers_ExtraColumns.CustomerID

So that's one con to partitioning off the columns.

The other con is that i have to manage inserting rows into two tables simultaneously, rather than just one.

The final con i can think of is that SQL Server now has to perform an INNER JOIN any time i want to access "Customers". There will now and forever a waste of CPU and I/O to join tables that really are one table - except that i had decided to split them up.

So my question is: why would i split them up?

Is there any value in vertically partitioning out 64 columns to a separate table when they will mostly be null? Null take up very little space....

What are the pros?

Edit: Why am i even considering partitioning? It's mostly null data that will triple the number of columns in the table. Surely it must be bad!

+1  A: 

If these values are a) unique to a record (a given customer should only have one value which would go in NewColumn1) and b) not used by any other record (at least, no other record that doesn't also require the base customer information) I'd say leave them as one table. Just don't forget to name your specific columns in any queries you write against the table.

I come from an EDI background, and sometimes you have to deal with flatfiles that contain 30+ columns of data per row. As you mention, NULL doesn't take up much room, and if you're NEVER going to be grapping the columns independently (and you'll never be able to grab the base customer data independently), I'd say you've got it right.

AllenG
+1  A: 

The answer is in details that were omitted from the question. The number of columns is irrelevant, it is the nature of the data that matters.

  • First, remember that a given row in any table can never exceed 8060 bytes. So if the new columns are sized such that that limit can theoretically be exceeded, you will have built a time-bomb into the database. Sometime when it is least convenient, a data insert or update will throw an error and/or data will be lost.

    To guard against this, you may need to use more than one table, it's just a limitation of most editions of SQL-Server.
    .

  • The other important consideration is data-modeling. Do the new columns have a one-to-one relationship with CustomerID? For example, say eyeColor?

    Because of the number of columns and the fact that you omitted their names, I suspect that a non-normalized design is being contemplated. If the new columns are something like WebPage1, WebPage2, WebPage3, etc., then these need to be split into a separate, normalized table.
    .

But, if the columns really are unique items, unrelated to each other and with a 1-to-1 relationship to CustomerID (or whatever the primary-key of that table is), and the size limit cannot be busted, then having everything in one table is perfectly fine.

Brock Adams
My tables have plenty of time-bombs already: `Comments varchar(8000)`. The real situation is a `Transactions` table, which currently joins to `Customers` (a list of customers). This violates a business rule requiring per-transaction customer information, in addition to a master list of Customers. The 64 new fields i'm adding are a clone of the Customers table, mashed into `Transactions`. It's not normalizable, because the same physical customer have no name, different names, different eye colors, no eye colors, no eyes, etc - and all different from what their "master" file says.
Ian Boyd
But the hard limit of 8060 bytes practically being hit is a **pro** that i'll have to investigate. The hard part will be trying to figure the current minimum/average/maximum row size in both tables...
Ian Boyd
@Ian Boyd: Use `sp_help` to determine the max size of a row. It will output a list of columns and their max length, in bytes. You are strongly advised to adjust these column sizes so that the row can never exceed 8060 bytes.You can use `SELECT len (ColumnName)`, `SELECT max (len (ColumnName) )`, etc. to determine current size statistics.
Brock Adams
@Ian Boyd: RE: `The 64 new fields i'm adding are a clone of the Customers table, mashed into Transactions. It's not normalizable, because the same physical customer have no name, different names, different eye colors, no eye colors, no eyes, etc - and all different from what their "master" file says.` This sounds like a pending disaster. Post **exact details** as a new question and I'd bet the fine folks at SO can help you sort it out.
Brock Adams
@Brock Adams You'd have to give me a hint about how it's a disaster waiting to happen. i can't really ask a question, "What do you think about this?", when i don't know what i'm asking.
Ian Boyd
@Brock Adams i used a `SELECT MAX(ISNULL(DATA_LENGHT(a), 0)+ISNULL(DATA_LENGHT(b), 0)...)` to get the largest row size; 450 bytes.
Ian Boyd
@Ian Boyd: `DATA_LENGHT()` or `DATA_LENGTH()` is not a function in SQL server 2000. Are you using MySql? Or did you use `DATALENGTH()`? Also note that `text`, `image`, and `ntext` columns only use 16 bytes out of your 8060 limit -- they're like pointers. But `DATALENGTH()` will report the, potentially huge, data-length even though it is is stored elsewhere and doesn't count against your row limit.
Brock Adams
@Ian: The pending disaster includes all the bugaboos of non-normalized design: Having to maintain the same data in more than one spot, and worse having to duplicate schemas. You'll get invalid data and probably have to start debugging triggers. Designing and maintaining T-SQL for the table(s) will become a convoluted nightmare (just look at the questions posted here, where design was not normalized). You also **will** eventually have someone enter data that busts that 8060 limit. Anywho, you certainly **can** post your exact details and the current problem and ask "What do you think"?
Brock Adams
@Brock Adams In 12 years i've never managed to remember the name of the function to get the length of data in a column (`LENGHT()`? `DATALEN()`? `DATA_LENGTH()?`, ...) If you can forgive the *obvious* typo, i'll own up to not having the right function name.
Ian Boyd
@Brock The question would seem to have to be, "How do i store customer information with a transaction? **Note:** The customer information must be allowed to be different for each transaction." i realize you might not like the business requirements, but they're not really my call. If i have a single `Customer` table, and they change their address, then that change would affect old transactions. i *need* to store data that was entered at the time of the transaction.
Ian Boyd
And this is why i rarely like to give real-world details. Rather than focusing on the question, it turns into an intellectual discussion about the merits of a business requirement, and how relational databases often conflict with real world demands.
Ian Boyd
@Ian: A transaction table, or a history table is a common need and does not violate normal forms, if done well. 64 columns seems a lot for that, but if the question had been clearer on the actual requirements, then the normalization issues might not have come up. Real-world details would have saved discussion, and maybe helped you get a more-tuned answer, faster.
Brock Adams
+2  A: 

For simplicity of data model, without further information, I would probably not partition, but you haven't indicated the nature of the data in these new columns (perhaps some columns are arrays which should be normalized instead).

However, some points:

If you do vertically partition, and have a FK constraint on the supplemental table, that may help eliminate the join in some scenarios, since it knows that one and only one row will exist. Obviously it will be indexed on the same unique keys, which will help to eliminate the need to determine if there is a cross-join, since there can only be 0 or 1 rows.

You can have a single updatable view which joins the two tables and have a trigger on the view which inserts into the two tables joined to make the view. You could also decide to do a left join and only create a supplemental row at all if any of the columns needing it are non-NULL.

You can also use a sparsely joined set of tables of supplemental data. Obviously this would also need joins, but you could also use similar techniques with multiple supplemental tables as you would with 1.

Cade Roux
i'm generally horrified of updatable views, notably because i believe (perhaps mistakenly) that they have to be schema-bound. And i'm not schema-binding a view ever again. Also, in the past i've had cases where i don't create the adjoining row - but maintenance of that becomes a nightmare (Insert: insert master row, only create supplemental row if needed. Update: check if the row exists, if it exists insert else update. Delete: check if the row exists, if it exists delete it). It was an interesting intellectual exercise, that had no value beyond making the system more complicated.
Ian Boyd
Is it best, do you think, for the partitioned table to have a clustering that matches the primary table - though different from the primary key. Seems to me if the tables are laid out in the same physical order, then that's as close to having one large table as i can get (i'm trying to think about how to best help out SQL Server)
Ian Boyd
@Ian Boyd I would not partition until performance or the data profile shows you should (assuming it's still normalized with all these columns). 96 columns in a table is not a design smell unless it looks unnormalized.
Cade Roux
@Ian Boyd Clustered indexes are only good for retrieving large amounts of data in a particular order - clustered index scans aren't significantly more performant than table scans - a clustered index is misnamed. A table is either a heap or a clustered table, really. I would cluster on (narrow, increasing) what looks to me like the same surrogate identity CustomerId to avoid page splits as data is added. Otherwise, a heap table with several non-clustered indexes with appropriate included columns so they are covering is just as good (the row bookmark is effective stored in the indexes)
Cade Roux
@Cade Roux: But if the two tables have the same physical order (no matter what terminology you prefer), i would think it would help SQL Server re-join them. The idea would be to keep them as close as physically possible to being one table
Ian Boyd
@Ian Boyd I doubt it - you'll probably have different numbers of rows per page (far more rows per page in the supplemental table), so by the time the first scan gets to the second page of primary rows, the scan on the supplemental data is still somewhere in the first page. Physical ain't really physical... It's still a logical order of pages...
Cade Roux
@Ian Boyd If you do decide to partition and are retrieving many and in the same order, I would cluster the same. But for lookup, style usage where you aren't retrieving millions of rows, I wouldn't think it would make a huge difference. Especially if you aren't really retrieving them at all because you've partitioned them off because they are rarely used in the first place.
Cade Roux

related questions