I have a client-server application which gets all the data out of a couple of tables, recalculates something and stores it.
Example:
Each Item has a 'Bill of Materials' = the list and quantities of which other items it is made out of. Therefore the cost of an item is the sum of the cost of the items in its BOM * their quantities. Ultimately, some "base" items have no BOM and just have the cost set independently. (ie: raw materials)
ie: A's BOM says its made out of 2xB and 3xC.
What I do now, and I don't remember why I do it like this, is I get all the items and all the BOMs out of the DB, and go for each item at a time calculating its cost recursively. Once I calculate one item, I flag it so I don't redo the cost again. (also guards against infinite recursion)
Thing is, this is kinda stupid: first, its slooow and gonna recalculate stuff that hasn't changed, and worse, give it a DB big enough, and it will run out of memory.
Instead, I could recalculate items on demand: when an Item's BOM changes, I recalculate that BOM, then SELECT all the BOMs which contain this updated Item, and recalculate them as well; rinse and repeat recursively 'till you reach the top, where no BOM in the DB depends on any of the changed items.
What this means in practice: say some of the Items are raw materials, whose cost might be updated often, and some Items are "end-user" stuff, whose BOM will rarely if ever change. When the user changes the cost of one of those materials, then it might mean going trough thousands of Items, recalculating them. Say a SELECT of 1 Item/BOM takes 15ms (I'm on Postgresql), then merely SELECTing 1000 Items/BOMs will take 15 seconds, and then you have to UPDATE the recalculated cost back into the Item in the DB... oh dear, latency can turn into minutes now.
The ERP software the company I work for uses is taking the 1st approach: batch-recalculate the entire DB at once. This literally takes hours, and it seems the problems have been building up with this approach, over the 10+ years of usage. The batch-recalculation is done weekly.
Now that I've actually "written this out loud", I don't think that it takes a few minutes matters too much. The problem is that I don't understand databases well, and I'm worrying about concurrency: since it will take a long time to update on Item A, it is likely someone will update a second Item B during the time Item A is being updated.
Say Item D is made out of the A and B above. User 1 updates A, so the server software begins masturbating with the DB for a couple of minutes, eventually updating D. But in the meantime, User 2 updates B, so the server will eventually update D again.
Will using Postgresql's transactions solve the problem? A transaction begins with the then-current state of the DB, so Transaction 1 sees D being made out of A1 and B1, and its updating A from A1 to A2, but before it finishes and commits, Transaction 2 will begin, also seeing A1 and B1. T1 recalculates and commits, D = A2 + B1. But T2 has already began, and doesn't see the new A, A2. So it then finally commits to the DB that D = A1 + B2, which is incorrect. It should be D = A2 + B2.
Also, some processing will overlap, wasting server time.
If I do T1 and T2 in sequence instead of in parallel, then hooray, the answer is correct, but User 2 will have to wait longer. Also, if a group of transactions have no relation to each other (completely independent... dependency trees; ie: A=X+Y and B=N+M), then parallel computation will give the correct answer AND will be faster for the user.
Important note: even when processing in sequence, I'd still use transactions, so the rest of the software can still work with that data in parallel, except for the function that recalculates cost.
Now, this whole "process-in-sequence" thing would not be so bad if.... the DB latency wouldn't be so "awful". If, say, the entire data would be held in RAM, then going trough 1000 objects would be a breeze. Ah, but even if I build a system to quickly move chunks of data to/from disk/RAM and do some caching -to replace the DB-, that won't do because I still need transactions so that the rest of the server functionality can work in parallel. ('important note' above) So I'd end up building another DB. Might be a bit faster, but its stupid/waste of time.
The whole reason I "cache" the cost of each Item is so that I don't recalculate it every time I use it, because not only does it waste limited resources, the DB latency is just too big and concurrency issues scale even worse.
Now I need no wonder why "they" did it in big batches... this is making my head hurt.
Q1: How do you guys solve this in an "optimum" way?
From my current understanding (that is, after facing the concurrency problem which before I silently ignored), I would make that function use transactions in sequence, and the rest of the app will still be able to use the data in parallel, which I believe is best for the user. That's the goal: best for the user, but guaranteed correctness for the system.
Maybe later on I could throw hardware at it and use software black magic to reduce that latency, but I'm beginning to lie to myself now.
Also, in the past couple of months I've been completely blind to several dead-obvious things (some not related to programming), so I'm expecting that someone will point out something shamefully obvious that I managed to miss... :|