views:

27

answers:

1

I work for a fulfillment company and we have to pack and ship many orders from our warehouse to customers. To improve efficiency we would like to group identical orders and pack these in the most optimum way. By identical I mean having the same number of order lines containing the same SKUs and same order quantities.

To achieve this I was thinking about hashing each order. We can then group by hash to quickly see which orders are the same.

We are moving from an Access database to a PostgreSQL database and we have .NET based systems for data loading and general order processing systems, so we can either do the hashing during the data loading or hand this task over to the DB.

My question firstly is should the hashing be managed by DB, possibly using triggers, or should the hash be created on-the-fly using a view or something?

And secondly would it be best to calculate a hash for each order line and then to combine these to find an order-level hash for grouping, or should I just use a trigger for all CRUD operations on the order lines table which re-calculates a single hash for the entire order and store the value in the orders table?

TIA

+1  A: 

Unless you have requirements constraining this, you can put the hash where you feel most comfortable. For example, it may be a lot easier to code in .net than in SQL. This is a workable approach if the orders in the database are not modified directly, but through a data access layer used by all your apps. The data access layer can then manage the hash.

Even with a hash in place, you will still have to check that the hashed orders are indeed the same. This is because its very difficult to create a perfect hash function - a function with no collisions where all objects hash to a different value, on data that can vary in structure so much.

This suggests that you will need a query (or code) that, given a set of orders, determines which of these are actually equal, grouping them into equivalence sets. E.g. the orders mapping to the same hash code - are they really equal. If you start here, then this query can also be used to find duplicate orders from the whole database. It may not be fast, in which case you can then look into improving performance by using hashing at the time the order is inserted/updated.

mdma