views:

44

answers:

4

I have a database that contains a history of product sales. For example the following table

CREATE TABLE SalesHistoryTable (
OrderID, // Order Number Unique to all orders
ProductID, // Product ID can be used as a Key to look up product info in another table
Price, // Price of the product per unit at the time of the order
Quantity, // quantity of the product for the order
Total, // total cost of the order for the product. (Price * Quantity)
Date, // Date of the order
StoreID, // The store that created the Order
PRIMARY KEY(OrderID)); 

The table will eventually have millions of transactions. From this, profiles can be created for products in different geographical regions (based on the StoreID). Creating these profiles can be very time consuming as a database query. For example.

SELECT ProductID, StoreID,
 SUM(Total) AS Total,
SUM(Quantity) QTY,
SUM(Total)/SUM(Quantity) AS AvgPrice
FROM SalesHistoryTable
GROUP BY ProductID, StoreID;

The above query could be used to get the Information based on products for any particular store. You could then determine which store has sold the most, has made the most money, and on average sells for the most/least. This would be very costly to use as a normal query run anytime. What are some design descisions in order to allow these types of queries to run faster assuming storage size isn’t an issue. For example, I could create another Table with duplicate information. Store ID (Key), Product ID, TotalCost, QTY, AvgPrice And provide a trigger so that when a new order is received, the entry for that store is updated in a new table. The cost for the update is almost nothing.

What should be considered when given the above scenario?

+1  A: 

I'd consider:

  • a data warehouse/OLAP solution
  • (as you said) run your data mining queries against a separate precomputed table/dataset
  • indexed/materialised views which is almost the same as the previous point

There are some questions though:

  • do you expect real time data?
  • what is your write volume?
  • what DB engine?
gbn
+1: The data could be realtime with the inherit latency delays of course. I suppose putting in Batch jobs and making the data update 1/hour or somesuch could be an option as well as Eric had mentioned. The write volume would be on the order of >1000/day. However I have access to data that goes back to 2006. I'm not sure yet since I have not created and imported the data, but I'm guessing there is over 1.5 million rows of information.
galford13x
+1  A: 

You may want to look into using materialized views, which will only be queried periodically.

ElectricDialect
+1: Thanks, I haven't heard of materialized views. I will certainly look into them.
galford13x
A: 

"The cost for the update is almost nothing."

Except that all updates must now be serialized. Because no matter what, the ancient law of physics still remains that no two things can be in the same place at the same time.

Erwin Smout
I think I see what your saying, but I'm not sure how that applies. If there are 1000 sales every hour, that would mean 1000 inserts into the SalesHistoryTable and 1000 triggers that cause result in 2 additions and a division + an row update. That seems to be much cheaper then running the query 1000 times right?
galford13x
Perhaps I should change my statement to, "The cost for the update is almost nothing compared to the query"? That might be a little more relative.
galford13x
+2  A: 

This is normally something you would use a data warehouse for, but aside from that, using a trigger to update a second table is a perfectly viable option.

You could also have a second table that is populated by a batch job on a periodic basis (a more data-warehouse like option). You could also use a materialized view if your database supports them.

Eric Petroelje
+1: Thanks I'll look into materialized views.
galford13x