views:

174

answers:

6

I have a routine that will be creating individual tables (Sql Server 2008) to store the results of reports generated by my application (Asp.net 3.5). Each report will need its own table, as the columns for the table would vary based on the report settings. A table will contain somewhere between 10-5,000 rows, rarely more than 10,000.

The following usage rules will apply:

  • Once stored, the data will never be updated.
  • Whenever results for the table are accessed, all data will be retrieved.
  • No other table will need to perform a join with this table.

Knowing this, is there any reason to create a PK index column on the table? Will doing so aid the performance of retrieving the data in any way, and if it would, would this outweigh the extra load of updating the index when inserting data (I know that 10K records is a relatively small amount, but this solution needs to be able to scale).

Update: Here are some more details on the data being processed, which goes into the current design decision of one table per report:

  • Tables will record a set of numeric values (set at runtime based on the report settings) that correspond to a different set of reference varchar values (also set at runtime based on the report settings).
  • Whenever data is retrieved, it some post-processing on the server will be required before the output can be displayed to the user (thus I will always be retrieving all values).

I would also be suspicious of someone claiming that they had to create a new table for each time the report was run. However, given that different columns (both in number, name and datatype) could conceivably be needed for every time the report was run, I don't see a great alternative.

The only other thing I can think of is to have an ID column (identifying the ReportVersionID, corresponding to another table), ReferenceValues column (varchar field, containing all Reference values, in a specified order, separated by some delimiter) and NumericValues column (same as ReferenceValues, but for the numbers), and then when I retrieve the results, put everything into specialized objects in the system, separating the values based on the defined delimiter). Does this seem preferable?

+3  A: 

Primary keys are not a MUST for any and all data tables. True, they are usually quite useful and to abandon them is unwise. However, in addition to a primary missions of speed (which I agree would doubtfully be positively affected) is also that of uniqueness. To that end, and valuing the consideration you've already obviously taken, I would suggest that the only need for a primary key would be to govern the expected uniqueness of the table.

Update: You mentioned in a comment that if you did a PK that it would include an Identity column that presently does not exist and is not needed. In this case, I would advise against the PK altogether. As @RedFilter pointed out, surrogate keys never add any value.

Brad
+1: And, of course, a surrogate `PK` does not help in providing uniqueness, so only a natural `PK` should be considered, if any.
RedFilter
What RedFilter said. Also, what defenses are you taking to make sure the same data doesn't get entered twice, in two different rows? What are the consequences if this does occur?
Walter Mitty
To judge if the speed will be positively affected we should know if the data is going to be read many times and how. For now we know that the data will not be updated, which suggests single writes and many reads. This suggest that the index might be useful; however OP also states that it will always be necessary to read all of the data (which will result in scans so the index will not be used). So we don't know...
Unreason
@Unreason - correct: single write and many reads. And all of the data will always be read (since it will need some post-processing involving all rows - on the server before being displayed to the user)
Yaakov Ellis
@Walter - defense to make sure same data doesn't get entered twice in two different rows is to ensure by app logic that the table only has data inserted once, ever. After that it is read-only, for all intents and purposes. And simply adding a PK wouldn't prevent duplicate data - would need to add a Unique constraint covering the significant columns that would define uniqueness (not necessarily the same as a PK).
Yaakov Ellis
@Yaakov Ellis: A PK prevents duplicate data as long as it is defined on the right columns. In this respect it is exactly the same as any other uniqueness constraint. If you wish to prevent duplicate data then a key is usually the most efficient and effective way to do it. Enforcing uniqueness only in the app is less satisfactory because you may eventually have other apps, developers, DBA maintenance that touch the table as well. Also the key is visible to other data consumers and software tools and can be used by the database engine to improve the efficiency of queries.
dportas
@dportas - I know about uniqueness, etc. It is your last sentence that I really wanted to investigate "Also the key is visible to other data consumers and software tools and can be used by the database engine to improve the efficiency of queries" - the only query that will ever be done is to select all columns, all rows. Data will only ever be entered once. In such a scenario, does a PK column provide any benefit? Or is it just that one should be put in, because "you always need a PK column"?
Yaakov Ellis
You need a key (to ensure uniqueness). You don't need to create a a new column for that if you are never going to use it (obviously).
dportas
@Yaakov, I think it's now pretty clear to all that--provided that your table creation scripts are guaranteed to not duplicate rows--then the cost of extra means (i.e. add a `PK`) would outweigh any possible benefit.
Brad
@Yakakov Ellis, just a comment on - 'simply adding a PK wouldn't prevent duplicate data'... well, depends on your primary key; initial idea behind primary key is to use it to guarantee row uniqueness (and yes it can be any of candidate keys and it can be composite) and it is your job, as database designer, to decide which column(s) will make primary key. In this sense primary key *does* guarantee uniqueness of entity instance (so that no contradicting facts can be stored in your relations).
Unreason
A: 

will it 1 table for every run of a given report, or one table to all runs of a given report? in other words, if you have Report #1 and you run it 5 times, over a different range of data, will you produce 5 tables, or will all 5 runs of the report be stored in the same table?

If you are storing all 5 runs of the report in the same table, then you'll need to filter the data so that it is appropriate to the run in question. in this case, having a primary key will let you do the where statement for the filter, much faster.

if you are creating a new table for every run of the report, then you don't need a primary key. however, you are going to run into to other performance problems as the number of tables in your system grows... assuming you don't have something in place to drop old data / tables.

Derick Bailey
Creating a new table for every run of the report (since the report settings - and therefore columns - can change between runs). There will eventually be some archiving functionality put into place to remove old tables - you have any idea what number of tables would start to affect performance (I am not worried about putting in too many tables, since sql server allows 2,147,483,647 objects in each DB).
Yaakov Ellis
Just read this comment which changes a picture very much - if you are creating a table for every run of a report then why are you creating a table at all? Why can't you just run a report of a query and not store results in a table?
Unreason
Running the report can take hours (can involve literally hundreds of queries, hitting tables with millions of rows). For usability and site stability, this is done in a background process, and the user is informed when it is done running. The actual output of the report (which basically is a large pivot table over multiple columns, summarizing the results of these many queries) may only be a table of relatively small dimensions (20-50 columns, 50-100 rows) compared to the data set, but it is impractical to have the user wait for the report to finish. Thus storing the results in the interim.
Yaakov Ellis
@Yaakov Ellis, your answer changes the picture again :) Did you do you homework on concepts of http://en.wikipedia.org/wiki/Multidimensional_database? Interestingly (but probably largely depending on usage of the OLAP), there are papers that promote high levels of normalization for OLAP work.
Unreason
A: 

If you are really not using the tables for anything other than as a chunk of read-only data, you could just as well store all the reports in a single table, as XML values.

Guffa
I would think that processing the data to and from XML would be a pretty significant performance hit.
Cyrena
@Cyrena: Yes, but creating a table is also a significant performance hit...
Guffa
@Cyrena, if XML operations are always atomic (he always reads it all) and if you relieve the server from encoding/decoding you actually might be getting better performance.
Unreason
A: 

What column or columns would the PK index be built on? If just a surrogate identity column, you'll have no performance hit when inserting rows, as they'd be inserted "in order". If it is not a surrogate key, then you have the admittedly minor but still useful assurance that you don't have duplicate entries.

Is the primary key used to control the order in which report rows are to be printed? If not, then how do you ensure proper ordering of the information? (Or is this just a data table that gets summed one way and another whenever a report is generated?)

If you use a clustered primary key, you wouldn't use as much storage space as you would with a non-clustered index.

By and large, I find that while not every table requires a primary key, it does not hurt to have one present, and since proper relational database design requires primary keys on all tables, it's good practice to always include them.

Philip Kelley
PK not used to control print order, since all results are retrieved and need to be processed on server before being output to user. If I used any PK column, it would be a clustered into column with autoincrement.
Yaakov Ellis
@Philip Kelley: "If you use a clustered primary key, you wouldn't use as much storage space as you would with a non-clustered index" That isn't true. A clustered index usually takes up more space than a nonclustered one and often it also increases the size of any nonclustered indexes on the same table because the cluster key is included in each of the other indexes. So clustered indexes almost invariably increase the size of indexes in total and require more storage than nonclustered indexes on a heap.
dportas
My statement was in response to *this* question, and not a general-purpose comment. The table @Yaakov proposed was indexless (or so I interpreted), and he was thinking of adding just the one primary key. If there only is one index, a nonclustered index + the table will require more storage space than a clustered index + the table, since the nonclustered index will require the "additional" leaf-level of the indexes' search tree. As you say, things would be different if there were mutliple indexes present.
Philip Kelley
+1  A: 

I would keep it simple, just store the report results converted to json or xml, in a VARCHAR(MAX) column

AlexKuznetsov
+1  A: 

One of the most useful and least emphasized (explicitly) benefits of data integrity (primary keys and foreign key references to start with) is that it forces a 'design by contract' between your data and your application(s); which stops quite a lot of types of bugs from doing any damage to your data. This is such a huge win and a thing that is implicitly taken for granted (it is not 'the database' that protects it, but the integrity rules you specify; forsaking the rules you expose your data to various levels of degradation).

This seems unimportant to you (from the fact that you did not even discuss what would be a possible primary key) and your data seems quite unrelated to other parts of the system (from the fact that you will not do joins to any other tables); but still - if all things are equal I would model the data properly and then if primary keys (or other data integrity rules) are not used and if chasing every last bit of performance I would consider dropping them in production (and test for any actual gains).

As for comments that creating tables is a performance hit - that is true, but you did not tell us how temporary are these tables? Once created will they be heavily used before scrapped? Or do you plan to create tables for just dozen of read operations.

In case you will heavily use these tables and if you will provide clean mechanism for managing them (removing them when not used, selecting them, etc...) I think that dynamically creating the tables would be perfectly fine (you could have shared more details on the tables themselves; use case would be nice)

Notes on other solutions:

EAV model

is horrible unless very specific conditions are met (for example: flexibility is paramount and automating DDL is too much of a hassle). Keep away from it (or be very, very good at anticipating what kinds of queries will you have to deal with and rigorous in validating data on the front end).

XML/BLOB approach

might be the right thing for you if you will consume the data as XML/BLOBs at presentation layer (always read all of the rows, always write the whole 'object' and finally, if your presentation layer likes XML/BLOBS)

EDIT: Also, depending on the usage patterns, having primary key can indeed increase the speed of retrieval, and if I can read the fact that the data will not be updated as 'it will be written once and read many times' then there is a good chance that it will indeed overweight the cost of updating the index on inserts.

Unreason