views:

176

answers:

2

I have a client-server application which gets all the data out of a couple of tables, recalculates something and stores it.

Example:

Each Item has a 'Bill of Materials' = the list and quantities of which other items it is made out of. Therefore the cost of an item is the sum of the cost of the items in its BOM * their quantities. Ultimately, some "base" items have no BOM and just have the cost set independently. (ie: raw materials)

ie: A's BOM says its made out of 2xB and 3xC.

What I do now, and I don't remember why I do it like this, is I get all the items and all the BOMs out of the DB, and go for each item at a time calculating its cost recursively. Once I calculate one item, I flag it so I don't redo the cost again. (also guards against infinite recursion)

Thing is, this is kinda stupid: first, its slooow and gonna recalculate stuff that hasn't changed, and worse, give it a DB big enough, and it will run out of memory.

Instead, I could recalculate items on demand: when an Item's BOM changes, I recalculate that BOM, then SELECT all the BOMs which contain this updated Item, and recalculate them as well; rinse and repeat recursively 'till you reach the top, where no BOM in the DB depends on any of the changed items.

What this means in practice: say some of the Items are raw materials, whose cost might be updated often, and some Items are "end-user" stuff, whose BOM will rarely if ever change. When the user changes the cost of one of those materials, then it might mean going trough thousands of Items, recalculating them. Say a SELECT of 1 Item/BOM takes 15ms (I'm on Postgresql), then merely SELECTing 1000 Items/BOMs will take 15 seconds, and then you have to UPDATE the recalculated cost back into the Item in the DB... oh dear, latency can turn into minutes now.

The ERP software the company I work for uses is taking the 1st approach: batch-recalculate the entire DB at once. This literally takes hours, and it seems the problems have been building up with this approach, over the 10+ years of usage. The batch-recalculation is done weekly.

Now that I've actually "written this out loud", I don't think that it takes a few minutes matters too much. The problem is that I don't understand databases well, and I'm worrying about concurrency: since it will take a long time to update on Item A, it is likely someone will update a second Item B during the time Item A is being updated.

Say Item D is made out of the A and B above. User 1 updates A, so the server software begins masturbating with the DB for a couple of minutes, eventually updating D. But in the meantime, User 2 updates B, so the server will eventually update D again.

Will using Postgresql's transactions solve the problem? A transaction begins with the then-current state of the DB, so Transaction 1 sees D being made out of A1 and B1, and its updating A from A1 to A2, but before it finishes and commits, Transaction 2 will begin, also seeing A1 and B1. T1 recalculates and commits, D = A2 + B1. But T2 has already began, and doesn't see the new A, A2. So it then finally commits to the DB that D = A1 + B2, which is incorrect. It should be D = A2 + B2.

Also, some processing will overlap, wasting server time.

If I do T1 and T2 in sequence instead of in parallel, then hooray, the answer is correct, but User 2 will have to wait longer. Also, if a group of transactions have no relation to each other (completely independent... dependency trees; ie: A=X+Y and B=N+M), then parallel computation will give the correct answer AND will be faster for the user.

Important note: even when processing in sequence, I'd still use transactions, so the rest of the software can still work with that data in parallel, except for the function that recalculates cost.

Now, this whole "process-in-sequence" thing would not be so bad if.... the DB latency wouldn't be so "awful". If, say, the entire data would be held in RAM, then going trough 1000 objects would be a breeze. Ah, but even if I build a system to quickly move chunks of data to/from disk/RAM and do some caching -to replace the DB-, that won't do because I still need transactions so that the rest of the server functionality can work in parallel. ('important note' above) So I'd end up building another DB. Might be a bit faster, but its stupid/waste of time.

The whole reason I "cache" the cost of each Item is so that I don't recalculate it every time I use it, because not only does it waste limited resources, the DB latency is just too big and concurrency issues scale even worse.

Now I need no wonder why "they" did it in big batches... this is making my head hurt.

Q1: How do you guys solve this in an "optimum" way?

From my current understanding (that is, after facing the concurrency problem which before I silently ignored), I would make that function use transactions in sequence, and the rest of the app will still be able to use the data in parallel, which I believe is best for the user. That's the goal: best for the user, but guaranteed correctness for the system.

Maybe later on I could throw hardware at it and use software black magic to reduce that latency, but I'm beginning to lie to myself now.

Also, in the past couple of months I've been completely blind to several dead-obvious things (some not related to programming), so I'm expecting that someone will point out something shamefully obvious that I managed to miss... :|

+2  A: 

I don't remember why I do it like this...

This jumps out at me, as the first thing you need to tackle!

There shouldn't be any reason you need to fetch data back to your application just to calculate the aggregate cost of each BOM. There are numerous techniques to work with "parts explosion" or hierarchical data sets in SQL.

I cover several solutions in my presentation "SQL Antipatterns Strike Back", or you can read a book like "Joe Celko's Trees and Hierarchies in SQL."

Some solutions are vendor-specific and some can be done with any plain SQL DBMS. I didn't notice what brand of database, but Jonathan correctly makes me aware that you're using PostgreSQL.

In that case, you should read about "WITH" queries, which are new in PostgreSQL 8.4, and allow you to do some sophisticated recursive query effects.

http://www.postgresql.org/docs/current/static/queries-with.html

I've implemented a system where BOM's were composed of hierarchies of individual resources, and I didn't have to do any of the batch processing you're describing (admittedly, there were only a few thousand resources in the db while I worked on it).

You should learn how to use aggregate function in SQL like SUM() and GROUP BY (any book on SQL should include this), and also techniques of storing hierarchical relationships of entities.

Since you say you don't understand databases well, I recommend that you try implementing a "toy" system before you make any changes to your real system. I'm only speaking from personal experience, but I find that I can't learn a new technical skill while I'm simultaneously trying to employ that skill in a real project.

Bill Karwin
He mentions "I'm on PostgreSQL" at one point...
Jonathan Leffler
Oops! I missed that.
Bill Karwin
+1  A: 

This sounds to me like a calculation that would benefit from being a stored procedure in the database, more or less regardless of which implementation method you use. That cuts down on the traffic between client and server, which almost invariably improves the performance of a complex set of calculations like this.

You say:

What I do now, and I don't remember why I do it like this, is I get all the items and all the BOMs out of the DB, and go for each item at a time calculating its cost recursively. Once I calculate one item, I flag it so I don't redo the cost again. (also guards against infinite recursion).

I'm puzzled about the 'flag it' part of this explanation - and not knowing why you do something the way you do it is bad news. You really need to understand what you are doing.

There are many ways to do BOM processing - and Bill Karwin has pointed you at some interesting info (the SQL Antipatterns link is about 250 slides!). The SQL Antipatterns section discusses 'naïve trees' (such as those outlined below). However, the solutions do not cover the case outlined below, where the same sub-tree can be used by multiple parents (because one sub-assembly can be a component of multiple products).

  • Path enumeration doesn't work: you can't use the same sub-assembly information because you build the containing product information into the path.
  • Nested sets work fine when the sub-assembly is used in one product; not when the sub-assembly is used in many products.
  • The 'closure table' solution can be adapted to cover this - it is more or less the second alternative below.

You need to consider whether it makes sense to be doing a bottom-up scan of the affected parts or whether you will be better off doing some sort of breadth-first or depth-first scan. One driver on this decision making will be the nature of the BOM data. If you have a structure where some sub-assembly is used as a component of multiple products, do you record the parts use in the sub-assembly separately for each product, or do you record that the products use the sub-assembly?

To clarify:

  • Sub-assembly A (P001) contains 24 x 8mm nuts (P002), 24 x 8mm x 50 mm bolts (P003), 1 x baseplate (P004), 1 x coverplate (P005).
  • Product B (P006) contains 1 x Sub-assembly A and a number of other parts.
  • Product B (P007) contains 1 x Sub-assembly B and a number of other parts.

Your BOM records could look like this (naïve tree):

Part      Component     Quantity
P001      P002          24
P001      P003          24
P001      P004          1
P001      P005          1
P006      P001          1
P007      P001          1

Or they could look like this (closure table):

Part      Component     Quantity
P001      P002          24
P001      P003          24
P001      P004          1
P001      P005          1
P006      P002          24
P006      P003          24
P006      P004          1
P006      P005          1
P007      P002          24
P007      P003          24
P007      P004          1
P007      P005          1

This second case is much less desirable - it is much harder to get the values right, doubly so if, as in the case of parts like nuts or bolts, multiple sub-assemblies could use the same part, so getting the counts right in a major deliverable product (P006, P007) would be very hard. However, recalculating the cost of any part is much simpler in the second case - you simply count up the sum of the 'cost times quantity' for each component that makes up a part. If you retain the naïve tree to record the part-structure breakdown and (re)compute the closure table when the structure (not price) of some product or sub-assembly changes, then you are probably as close to nirvana as you're likely to get.

Somewhere (but on another computer than this one) I have some old code to mess around with this stuff, using fictitious assemblies. The coding was done ... mumble, mumble ... a long time ago, and uses temporary tables (and doesn't mention nested sets or path enumeration; it does compute closure tables) for a specific DBMS - it would have to be adapted to other DBMS. Ask, and I'll dig it out.

Jonathan Leffler