views:

115

answers:

3

This is not a question about using another tool. It is not a question about using different data structure. It is issue about WHY I see what I see -- please read to the end before answering. Thank you.

THE STORY

I have one single table which has one condition, the records are not deleted. Instead the record is marked as not active (there is field for that) and in such case all fields (except for identifiers and this isActive field) are considered irrelevant.

More on identifiers -- there are two fields:

  • id -- int, primary key, clustered
  • name -- unique, varchar, external index

How the update is done for example (I use C#/Linq/MSSQL2005): I fetch the record basing on name, then change required fields and commit the changes, so the update is executed (UPDATE uses id, not name).

However there is a problem with storage. So why not not break this table into dual structure -- "header" table (id, name, isActive) and data table (id, rest of the fields). In case of a problem with storage we can delete all records from data table for real (for isActive=false).

edit (by Shimmy): header+data are not retrieved by LINQ with join. Data records are loaded on demand (and this always occurs because of the code).

comment (by poster): AFAIR there is no join, so this is irrelevant. Data for headers were loaded manually. See below.

Performance -- (my) Theory

Now, what about performance? Which one will be faster? Let's say you have 10000 records in both tables (single, header, data) and you update them one by one (all 3 tables) -- fields isActive and some field from the "data" fields.

My calculation was/is:

  • mono table -- search using external index, then jumping into the structure, fetching all the data, update using primary key.

  • dual tables -- search using external index, jumping into the header table, fetching all the data, search using primary key on data table (no jumping here, it is clustered index), fetching all the data, update both tables using primary keys.

So, for me mono structure should be faster, because in dual case I have the same operations plus some extras.

The results

No matter what I do, update, select, insert, dual structure is either slightly better (speed) or up to 30% faster. And now I am all puzzled -- I would understand that I if were insert/update/select only header records, but in every case data records are used as well.

The question -- why/how dual structure can be faster?

A: 

Have you looked at the two query plans? That often gives it away.

As for speculation, the size of a row in a table affects how fast you can scan it. smaller rows means more rows fit into a data page. the brunt of a query is usually in the I/O time, so using two smaller tables greatly reduces the amount of data you have to sift through in the indexes.

Also, the locks are more granular -- the first update could write to table1, and then the second update could write to table1 while you're finishing up table2.

Jimmy
I looked at est.ex.plain and didn't find anything suspicious. Fetching record from header table is surely the fastest but because there are a lot of fields in data table fetching record from there and mono table are on par.
macias
+1  A: 

I think this all boils down to how much data is being fetched, inserted, and updated.

SELECT case - in the dual-table configuration you're fetching less data. Database runtime is heavily dominated by I/O time, so having the "header" fields replicated on each row in the single-table configuration means you have to read that same data over and over again. In the dual-table configuration you only read the header data once.

INSERT case - similar to the above, but related to writing the data instead of reading it.

UPDATE case - your code updates the "isActive" field, which if I've read it correctly is part of the "header" fields. In the single-table configuration you're forcing many rows to be updated for each "isActive" change. In the dual-table configuration you're only updating a single header row for each "isActive" change.

I think this is a case of premature optimization. I get the feeling that you understood that according to data normalization rules the dual-table configuration was "better" - but because the single-table case seemed like it would provide better performance you wanted to go with that design. Thankfully you took the time to test what would happen and found that observed performance did not match your expectations. GOOD JOB! I wish more people would take the time to test things out like this. I think the lesson to learn here is that data normalization is a Good Thing.

Remember, the best time to optimize something is NEVER! The second-best time to optimize things is when you have an observed performance problem. The worst time to optimize is during analysis.

I hope this helps.

Bob Jarvis
+1 for the advice and mentioning his scientific approach !
BlueTrin
You didn't read my post. I explicitly mentioned that in all cases all (!) 3 tables are used. It means on insert, records are added to header, data, and mono tables. The same for update and select. So for every operation the more data are transfered <-> dual structure, because of the duplication of id.
macias
+1  A: 

Assumption: Sql Server for database.

Sql Server tends to be higher in performance on narrow tables rather than wide. While this might not be true for something such as a mainframe.

This really points to normalization until you decide NOT to for performance reasons, and in this case the assumption that de-normalized tables would be more efficient is incorrect. Normalized structures can be better managed in the resources than de-normalized in this environment. I suspect (no citable basis of this) that the resource (hardware, multiprocessors, threading etc) makes the normalized structure faster because more stuff gets done at the same time.

Mark Schultheiss
Good assumption :-), I wrote this in the post before -- MSSQL2005 :-). Thank you for the answer, the "problem" is I don't have here normalization case -- you normalize tables to avoid redundancy, like keeping data of people, you would set a separate table for first names, because they repeat. Here I have relation 1:1, in dual structure for each record in header there is (in _this_ case) record in data table and vice versa. Redundancy is exactly the same as before (in mono table).
macias
Some of the performance here is the ability of the platform to manage the tables, even if it is a 1-1, it is better able to handle narrow tables within the resources.
Mark Schultheiss