ansaurus

Question

How to store the following SQL data optimally in SQL Server 2008

Answer 1

A:

I'd suggest instead of two tables create one table Poster.
It's ok to have some fields empty if they are not applicable to one kind of poster.

Poster:
ID (guid), type, name, phone, email, password

where type is 1 for company, 2 - for unregistered user.

OR

Keep your users and companies separate, but require each company to have a user in users table. That table should have a CompanyID field. I think it would be more logical and elegant.

z-boss 2010-02-22 18:14:44

That could be an alternative, the only problem that i see is the excess use of space. There will be a lot more persons posting articles then companies. Also, i forgot to mention, the company table contains a bit more information then the unregistered user version. It also contains, street, zip, city, www, description and a few more fields that i forgot (I am not at my normal computer)This would mean a lot of extra space wasted.

Patrick 2010-02-22 18:55:55

Honestly, it won't matter. Having empty columns for street, zip, city etc. will not affect performance. Btw, you may want to allow authors that are people to enter that information at some point. You are "wasting" that space. If it were a key column on which you were filtering with regularity, that might be different.

Thomas 2010-02-27 15:55:10

Answer 2

A:

My preference is to use a table that acts like a super table to both.

ArticleOwner = (ID (guid), company name, phone, email)
company = (ID, password)
unregistereduser = (ID)
article = (ID (int/guid/short guid), headline, content, published date, owner)

Then querying the database will require a JOIN on the 3 tables but this way you do not have the null fields.

Vincent Ramdhanie 2010-02-22 18:17:51

Thats true, but then you will always query 3 tables when you want to query the articles and that also reduces performance.

Patrick 2010-02-22 18:52:56

Answer 3

A:

An interesting approach would be to use the Node model followed by Drupal, where everything is effectively a Node and all other data is stored in a secondary table. It's highly flexible and as is evidenced by the widespread use of Drupal in large publishing and discussion sites.

The layout would be something like this:

Node

ID Type (User, Guest, Article) TypeID (PKey of related data) Created Modified

Article

ID Field1 Field2 Etc.

User

ID Field1 Field2 Etc.

Guest

ID Field1 Field2 Etc.

It's an alternative option with some good benefits. The greatest being flexibility.

Nissan Fan 2010-02-22 20:55:29

But doesn't it reduce performance?

Patrick 2010-02-24 11:36:20

Answer 4

A:

I have solved similar problems by an approach similar to this:

Company -> Company
Articles User -> UserArticles

Articles

CompanyArticles contains a mapping from Company to an Article UserArticles contains a mapping from User to Article

Article doesn't know anything about who created it.

By inverting the dependencies here you end up not overloading the meaning of foreign keys, having unused foreign keys, or creating a super table.

Getting all articles and contact information would look like:

SELECT name, phone, email FROM 
user 
JOIN userarticles on user.user_id = userarticles.user_id
JOIN articles on userarticles.article_id = article.article_id

UNION

SELECT name, phone, email FROM 
company
JOIN companyarticles on company.company_id = companyarticles.company_id
JOIN articles on companyarticles.article_id = article.article_id

John Sonmez 2010-02-25 19:20:01

What happens if you only have an article number? How would you then query the database? By first doing a search to see what table that matches and then query against it?

Patrick 2010-02-26 09:50:22

I would probably create a view that unions the common data between the user and company tables with the article data. Or you can select from the union of the two mapping tables.

John Sonmez 2010-02-27 08:10:33

Answer 5

+2 A:

This is a common OO programming problem that should not be solved in the SQL domain. It should be handled by your ORM

Make two classes in your program code as required and let you ORM map them to a suitable SQL representation. For performance a single table with nulls will do, the only overhead is the discriminator column

Some examples hibernate inheritance

TFD 2010-02-25 21:02:48

The problem with this solution is that the database is always just an application storage. Consider reporting and sharing of data between application. In those situations the data needs to have it's full meaning exist in the database, not in the application logic.

John Sonmez 2010-02-28 21:49:35

Answer 6

+3 A:

Ideally if you could use ORM (as mentioned by TFD), I would do so. Since you have not commented on that as well as you always come back with the "performance" question, I assume you would not like to use one.

Using pure SQL, the approach I would suggest would be to have table structure as below:

ActicleOwner [ID (guid)]
Company [ID (guid) - PK as well as FK to ActicleOwner.ID, 
    company name, phone, email, password, street, zip, ...]
UnregisteredUser [ID (guid) - PK as well as FK to ActicleOwner.ID, 
    name, phone, email]
Article = [ID (int/guid/short guid), headline, content, published date, 
    ArticleOwnerID - FK to ActicleOwner.ID]

Lets see usages:

INSERT: overhead is the need to add a row to ActicleOwner table for each Company/UU. This is not the operation that happens so often, there is no need to optimize performance

SELECT:

Company/UU: well, it is easy to search for both UU and Company, since you do not need to JOIN to any other table, as all the info about the required object is in one table
Acticles of one Company/UU: again, you just need to filter on the GUID of the Company/UU, and there you go: SELECT (list fields) FROM Acticle WHERE ArticleOwnerID = @AOID

Also think that one day you might need to support multiple Owners in the Article. With the parent table approach above (or mentioned by Vincent) you will just need to introduce relation table, whereas with 2 NULL-able FK constraints to each Owner table is solution you are kind-of stuck.

Performance:
Are you sure you have performance problem? What is your target?

One thing I can recommend looking at you model regarding performance is not to use GUIDs as clustered index (which is the default for a PK). Because basically your INSERT statements will be inserting data randomly into the table.
Alternatives are:

use Sequential GUID instead (see: What are the performance improvement of Sequential Guid over standard Guid?)
use both INTEGER and GUID. This is someone complicated approach and might be an overkill for a simple model you have, but the result is that you always JOIN tables in SELECTs on INTEGER instead of GUID, which is much faster.

So if you are so hot on performance, you might try to do the following:

ActicleOwner (ID (int identity) - PK, UID (guid) - UC)
Company [ID (int) - PK as well as FK to ActicleOwner.ID,
         UID (guid) - UC as well as FK to ActicleOwner.UID, company name, ...]
...
Article = [ID (int/guid/short guid), headline, content, published date, 
    ArticleOwnerID - FK to ActicleOwner.ID (int)]

To INSERT a user (Company/UU) you do the following:

Having a UID (maybe sequential one) from the code, you do INSERT into ActicleOwner table. You get back the autogenerated integer ID.
you insert all the data into Company/UU, including the integer ID that you have just received.

ActicleOwner.ID will be integer, so searching on it will be faster then on UID, especially when you have an index on it.

van 2010-02-27 09:21:40

Answer 7

+1 A:

I would suggest the super-type Author for Person and Organization sub-types.

alt text

Note that AuthorID serves as the primary and the foreign key at the same time for Person and Organization tables.

So first let's create tables:

CREATE TABLE Author( 
   AuthorID integer IDENTITY NOT NULL
  ,AuthorType char(1)
  ,Phone varchar(20)
  ,Email varchar(128) NOT NULL
  );
ALTER TABLE Author ADD CONSTRAINT pk_Author PRIMARY KEY (AuthorID);

CREATE TABLE Article ( 
   ArticleID integer IDENTITY NOT NULL
  ,AuthorID integer NOT NULL
  ,DatePublished date
  ,Headline varchar(100)
  ,Content varchar(max)
  );
ALTER TABLE Article ADD 
   CONSTRAINT pk_Article PRIMARY KEY (ArticleID)
  ,CONSTRAINT fk1_Article FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID) ;

CREATE TABLE Person ( 
   AuthorID integer NOT NULL
  ,FirstName varchar(50)
  ,LastName varchar(50)
  );
ALTER TABLE Person ADD 
   CONSTRAINT pk_Person PRIMARY KEY (AuthorID)
  ,CONSTRAINT fk1_Person FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID);

CREATE TABLE Organization ( 
   AuthorID integer NOT NULL
  ,OrgName varchar(40)
  ,OrgPassword varchar(128)
  ,OrgCountry varchar(40)
  ,OrgState varchar(40)
  ,OrgZIP varchar(16)
  ,OrgContactName varchar(100)
  );
ALTER TABLE Organization ADD 
   CONSTRAINT pk_Organization PRIMARY KEY (AuthorID)
  ,CONSTRAINT fk1_Organization FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID);

When inserting into Author you have to capture the auto-incremented id and then use it to insert the rest of data into person or organization, depending on AuthorType. Each row in Author has only one matching row in Person or Organization, not in both. Here is an example of how to capture the AuthorID.

-- Insert into table and return the auto-incremented AuthorID
INSERT  INTO Author ( AuthorType, Phone, Email )
    OUTPUT INSERTED.AuthorID
VALUES  ( 'P', '789-789-7899', '[email protected]' );

Here are a few examples of how to query authors:

-- Return all authors (org and person)
SELECT  *
FROM    dbo.Author AS a
        LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
        LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ; 

-- Return all-organization authors
SELECT  *
FROM    dbo.Author AS a
        JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ; 

-- Return all person-authors
SELECT  *
FROM    dbo.Author AS a
        JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID

And now all articles with authors.

-- Return all articles with author information
SELECT  *
FROM    dbo.Article AS x
        JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
        LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
        LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;

There are two ways to return all articles belonging to organizations. The first example returns only columns from the Organization table, while the second one has columns from the Person table too, with NULL values.

-- (1) Return all articles belonging to organizations
SELECT  *
FROM    dbo.Article AS x
        JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
        JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID;

-- (2) Return all articles belonging to organizations
SELECT  *
FROM    dbo.Article AS x
        JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
        LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
        LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE AuthorType = 'O';

And to return all articles belonging to a specific organization, again two methods.

-- (1) Return all articles belonging to a specific organization
SELECT  *
FROM    dbo.Article AS x
        JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
        JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE c.OrgName = 'somecorp';

-- (2) Return all articles belonging to a specific organization
SELECT  *
FROM    dbo.Article AS x
        JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
        LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
        LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE c.OrgName = 'somecorp';

To make queries simpler, you could package some of this into a view or two.

Just as a reminder, it is common for an article to have several authors, so a many-to-many table Article_Author would be in order.

Damir Sudarevic 2010-02-27 14:51:13

Answer 8

A:

I'm not convinced you need to distinguish between companies and persons; only registered and unregistered authors.

I added this for clarity. You could simply use a check constraint on the Authors table to limit the values to U and R.

Create Table dbo.AuthorRegisteredStates
(
      Code char(1) not null Primary Key Clustered
      , Name nvarchar(15) not null
      , Constraint UK_AuthorRegisteredState Unique ( [Name])
)
Insert dbo.AuthorRegisteredState(Code, Name) Values('U', 'Unregistered')
Insert dbo.AuthorRegisteredState(Code, Name) Values('R', 'Registered')
GO

The key in any database system is data integrity. So, we want to ensure that usernames are unique and, perhaps, that Names are unique. Do you want to allow two people with the same name to publish an article? How would the reader differentiate them? Notice that I don't care whether the Author represents a company or person. If someone is registering a company or a person, they can put in a first name and last name if they want. However, what is required is that everyone enter a name (think of it as a display name). We would never search for authors based on anything other than name.

Create Table dbo.Authors
(
      Id int not null identity(1,1) Primary Key Clustered
      , AuthorStateCode char(1) not null
      , Name nvarchar(100) not null
      , Email nvarchar(300) null
      , Username nvarchar(20) not null
      , PasswordHash nvarchar(50) not null
      , FirstName nvarchar(25) null
      , LastName nvarchar(25) null
      ...
      , Address nvarchar(max) null
      , City nvarchar(40) null
      ...
      , Website nvarchar(max) null
      , Constraint UK_Authors_Name Unique ( [Name] )
      , Constraint UK_Authors_Username Unique ( [Username] )
      , Constraint FK_Authors_AuthorRegisteredStates
           Foreign Key ( AuthorStateCode )
           References dbo.AuthorRegisteredStates ( Code )

      -- optional. if you really wanted to ensure that an author that was unregistered
      -- had a firstname and lastname. However, I'd recommend enforcing this in the GUI
      -- if anywhere as it really does not matter if they 
      -- enter a first name and last name.
      -- All that matters is whether they are registered and entered a name.
      , Constraint CK_Authors_RegisteredWithFirstNameLastName
           Check ( State = 'R' Or ( State = 'U' And FirstName Is Not Null And LastName Is Not Null ) )
)

Can a single author publish two articles on the same date and time? If not (as I've guessed here), then we add a unique constraint. The question is whether you might need to identify an article. What information might you be given to locate an article besides the general date it was published?

Create Table dbo.Articles
(
      Id int not null identity(1,1) Primary Key Clustered
      , AuthorId int not null
      , PublishedDate datetime not null
      , Headline nvarchar(200) not null
      , Content nvarchar(max) null
      ...
      , Constraint UK_Articles_PublishedDate Unique ( AuthorId, PublishedDate )
      , Constraint FK_Articles_Authors
           Foreign Key ( AuthorId )
           References dbo.Authors ( Id )
)

In addition, I would add an index on PublishedDate to improve searches by date.

Create Index IX_Articles_PublishedDate dbo.Articles On ( PublishedDate )

I would also enable free text search to search on the contents of articles.

I think concerns about "empty space" are probably premature optimization. The effect on performance will be nil. This is a case where a small amount of denormalizing costs you nothing in terms of performance and gains you in terms of development. However, if it really concerned you, you could move the address information into 1:1 table like so:

Create Table dbo.AuthorAddresses
(
    AuthorId int not null Primary Key Clustered
    , Street nvarchar(max) not null
    , City nvarchar(40) not null
    ...
    , Constraint FK_AuthorAddresses_Authors
        Foreign Key ( AuthorId )
        References dbo.Authors( Id )
)

This will add a small amount of complexity to your middle-tier. As always, the question is whether the elimination of some empty space exceeds the cost in terms of coding and testing. Whether you store this information as columns in your Authors table or in a separate table, the effect on performance will be nil.

Thomas 2010-02-27 15:42:20

Answer 9

+9 A:

Theory

The problem you described is called Table Inheritance in data modeling theory. In Martin Fowler's book the solutions are:

single table inheritance: a single table that contains all fields.
class table inheritance: one table per class, with table for abstract classes.
concrete table inheritance: one table per non-abstract class, abstract members are repeated in each concrete table

So from a theory and industry practice point of view all three solutions are acceptable: one table Posters with columns NULLable columns (ie. single table), three tables Posters, Companies and Persons (ie. class inheritance) and two tables Companies and Persons (ie. concrete inheritance).

Now, to pros and cons.

Cost of NULL columns

The record structure is discussed in Inside the Storage Engine: Anatomy of a record:

NULL bitmap

two bytes for count of columns in the record

variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)

So if you have at least one NULLable column, you pay the cost of the NULL bitmap in each record, at least 3 bytes. But the cost is identical if you have 1 or 8 columns! The 9th NULLable column will add a byte to the NULL bitmap in each record. the formula is described in Estimating the Size of a Clustered Index: 2 + ((Num_Cols + 7) / 8)

Peformance Driving Factor

In database system there is really only one factor that drives performance: amount of data scanned. How large are the record scanned by a query plan, and how many records does it have to scan. So to improve the performance you need to:

narrow the records: reduce the data size, covering include indexes, vertical partitioning
reduce the number of records scanned: indexes
reduce the number of scans: eliminate joins

Now in order to analyze these criteria, there is something missing in your post: the prevalent data access pattern, ie. the most common query that the database will be hit with. This is driven by how you display your posts on the site. Consider these possible approaches:

posts front page: like SO, a page of recent posts with header, excerpt, time posted and author basic information (name, gravatar). To get this page displayed you need to join Posts with authors, but you only need the author name and gravatar. Both single table inheritance and class table inheritance would work, but concrete table inheritance would fail. This is because you cannot afford for such a query to do conditional joins (ie. join the articles posted to either Companies or Persons), such a query will be less than optimal.
posts per author: users have to login first and then they'll see their own posts (this is common for non-public post oriented sites, think incident tracking for instance). For such a design, all three table inheritance schemes would work.

Conclusion

There are some general performance considerations (ie. narrow the data) to consider, but the critical information is missing: how are you going to query the data, your access pattern. The data model has to be optimized for that access pattern:

Which fields from Companies and Persons will be displayed on the landing page of the site (ie. the most often and performance critical query) ? You don't want to join 5 tables to show those fields.
Are some Company/Person information fields only needed on the user information page? Perhaps partition the table vertically into CompaniesExtra and PersonsExtra tables. Or use a index that will cover the frequently used fields (this approach simplifies code and is easier to keep consistent, at the cost of data duplication)

PS

Needless to say, don't use guids for ids. Unless you're building a distributed system, they are a horrible choice for reasons of excessive width. Fragmentation is also a potential problem, but that can be alleviated by use of sequential guids.

Remus Rusanu 2010-02-27 19:06:08

+1 Love the Fowler's reference -- I go back to it before every modeling task.

Damir Sudarevic 2010-02-28 16:15:59

ansaurus

tags:

views:

answers:

How to store the following SQL data optimally in SQL Server 2008

Theory

Cost of NULL columns

Peformance Driving Factor

Conclusion

PS

related questions