views:

82

answers:

4

I was wondering what the best approach would be for the following situation:

I have an Orders table in a database that obviously contains all orders. But these are literally ALL orders, so including the complete/finished ones that are just flagged as 'complete'. From all the open orders I want to calculate some stuff (like open amount, open items, etc). What would be better performance wise:

Keep 1 Orders table with ALL orders, including the complete/arhcived ones, and do calculations by filtering the 'complete' flag?

Or should I create another table, e.g. 'Orders_Archive', so that the Orders table would only contain open orders that I use for the calculations?

Is there any (clear) performance difference in these approaches?

(B.T.W. I'm on a PostgreSQL db.)

+2  A: 

Or should I create another table, e.g. 'Orders_Archive', so that the Orders table would only contain open orders that I use for the calculations?

Yes. They call that data warehousing. Folks do this because it speeds up the transaction system to eliminate the hardly-used history. First, tables are physically smaller and process faster. Second, a long-running history report doesn't interfere with transactional processing.

Is there any (clear) performance difference in these approaches?

Yes. Bonus. You can restructure your history so that it's no longer in 3NF (for updating) but in a Star Schema (for reporting). The advantages are huge.

Buy Kimball's The Data Warehouse Toolkit book to learn more about star schema design and migrating history out of active tables into warehouse tables.

S.Lott
Well, just archiving old records is still a far cry from a data warehouse. It's a start though...
sleske
I agree with this answer. But you are talking massive massive massive amounts of data before you really need to deploy something like this. An order table with 100000 orders is probably not going to need this
Sam Saffron
I thought it was called table partitioning, you use a view to emulate the complete table
Sam Saffron
@Sam Saffron: Massive? No. You'll find good results moving just thousands of rows of history out of active tables into star-schema "reporting" tables. Data Warehouse doesn't mean big. It means optimized for reporting. Which is a distinctive architecture. A "data mart" is what some people like to call them so no one gets scared off by sizing issues.
S.Lott
@Sam Saffron: There are a variety of techniques you can use. You can partition tables by actually setting up separate tables; then you can combine them into a view if you want. You can also let the DBMS handle the partitioning internally; that is called value-based partitioning.
sleske
@Sam Saffron: Table Partitioning is unrelated to creating a proper star schema. Some folks like it because it reduces lock contention and can optimize queries with minimal work. It doesn't dramatically improve report performance the way a star schema does. But it's low-cost.
S.Lott
@sleske: "just archiving old records" can be short-sighted. Don't start there. Start with history in a star-schema, re-normalized for reporting. That does three things: archive old records and speed up transactional performance and speed up historical reporting.
S.Lott
Thank you very much for this answer. I will look into Kimball's book and implement it that way. I want it to be 'robust' for a long time, so it can be that there would be lots of orders in the future. Creating warehouse tables sounds like the better thing to do.
Sem Tukenmez
+5  A: 

This is a common problem in database design: The question of whether to separate or "archive" records that are no longer "active".

The most common approaches are:

  • Everything in one table, mark orders as "complete" as appropriate. Pros: Simplest solution (both code- and structure-wise), good flexibility (e.g. easy to "resurrect" orders). Cons: Tables can get quite large, a problem both for queries and for e.g. backups.
  • Archive old stuff to separate table. Solves the problems from the first approach, at the cost of greater complexity.
  • Use table with value-based partitioning. That means logically (to the application) everything is in one table, but behind the scenes the DBMS puts stuff into separate areas depending on the value(s) on some column(s). You'd probably use the "complete" column, or the "order completion date" for the partitioning.

The last approach kind of combines the good parts of the first two, but needs support in the DBMS and is more complex to set up.

Note:

Tables that only store "archived" data are commonly referred to as "archive tables". Some DBMS even provide special storage engines for these tables (e.g. MySQL), which are optimized to allow quick retrieval and good storage efficiency, at the cost of slow changes/inserts.

sleske
+1  A: 

Since you are using postgresql, you can take advantage of partial index. Suppose for unfinished order you often use orderdate, you can specify index like this:

create index order_orderdate_unfinished_ix on orders ( orderdate )
  where completed is null or completed = 'f';

When you put that condition, postgresql will not index the completed orders, thus saving harddisk space and make the index much faster because it contains only small amount of data. So you get the benefit without the hassles of table separation.

When you separate data into ORDERS and ORDERS_ARCHIVE, you will have to adjust existing reports. If you have lots of reports, that can be painful.

See full description of partial index in this page: http://www.postgresql.org/docs/9.0/static/indexes-partial.html

EDIT: for archiving, I prefer to create another database with identical schema, then move the old data from transaction db to this archive db.

endy_c
+3  A: 

Never split off or separate current/archived data. It is simply incorrect. It may be called "data warehousing" or a bucket of fish, but it is wrong, unnecessary, and creates problems which were not otherwise present. The result is:

  • everyone who queries the data now has to look for it in two places rather than one
  • and worse, do the addition of aggregated values manually (in Excel or whatever)
  • you introduce anomalies in the key, the integrity is lost (which would otherwise be unique by a single db constraint)
  • when a Completed Order (or many) needs to be changed, you have to fish it out of the "warehouse" and put it back in the "database"

If, and only if the response on the table is slow, then address that, and enhance the speed. Only. Nothing else. This (in every case I have seen) is an indexing error (a missing index or the incorrect columns or the incorrect sequence of columns are all errors). Generally, all you will need is the IsComplete column in an index, along with whatever your users use to search most frequently, to in/exclude Open/Complete Orders.

Now, if your dbms platform cannot handle large tables, or large result sets, that is a different problem, and you have to use whatever methods are available in the tool. But as a database design issue, it is simply wrong; there is no need to create a duplicate, populate it, and maintain it (with all the ensuing problems) except if you are limited by your platform.

Both last year and this, as part of an ordinary performance assignment, I have consolidated such split tables with billions of rows (and had to resolve all the duplicate row problems that allegedly "did not exist", yeah right, 2 days just for that). The consolidated tables with the corrected indices were faster than the split tables; the excuse that "billions of rows slowed the table down" was completely false. The users love me because they no longer have to use two tools and query two "databases" to get what they need.

PerformanceDBA
Actually it is not a data warehouse by definition. One of the many wrong definitions people throw away. A data warehouse is a or contains reporting copies - it says nothing about deleting the original.
TomTom
@Tom. You're right. Too many loose and fluffy toys in the computer room these days.
PerformanceDBA