views:

49

answers:

3

I am trying to figure out the best way to handle inserting/updating/deleting large lists.

Specifically, my users need to select large lists of products and they will get reports on those items every night.

To oversimplify it, here is the data model (a simple many to many)

~ 5000 records total
+----------+------------+
| user_id  | user_name  |
+----------+------------+
|        1 | Ralph      |
|        2 | Bill       |
|        3 | Joe        |
|        4 | Mike       |
|        5 | Brian      |
|        6 | Jose       |
+----------+------------+

~ 6000 records total
+------------+------------+
| product_id |   product  |
+------------+------------+
|          1 | Widget A   |
|          2 | Widget B   |
|          3 | Widget C   |
|          4 | Widget D   |
|          5 | Widget E   |
|          6 | Widget F   |
+------------+------------+

As many as 30 million total
+----------+------------+
| user_id  | product_id |
+----------+------------+
|        1 |          1 |
|        1 |          4 |
|        1 |          6 |
|        2 |          2 |
|        2 |          4 |
|        2 |          5 |
+----------+------------+ 

The problem is that the products are selected in bulk, so if a user clicks select all (which they frequently do), they are selecting approximately 6000 products which equates to a large insert query.

Also, they can update and delete these lists based off of a bunch of different criteria such as what categories they fall in, price points, etc.

Every time they want to update their list I have to retrieve the products they've selected, delete the products they have deselected, and then insert any new products.

The process seems cumbersome at best and I'd like to know if there is a better solution.

I considered instead of storing the products the users want, store only the product the user doesn't want thereby limiting the overhead of frequent large insert/update queries. This way, every user get's every product available by default.

The problem with that solution is when new items arrive the user may not want those items in the report so then I'll have to maintain a separate table that stipulates what the default items are.

Many thanks to whomever can help me out.

Edit: Just for clarification, the users are not restricted to only selection criteria. They can also directly select products and groups of products. The users are unique in that they are all intimately familiar with the products (most know of almost all 6000 of the items).

A: 

Another possibility is to partition the users-products table. MySQL 5.1 added table partitioning support:

http://dev.mysql.com/doc/refman/5.1/en/partitioning.html

Every time they want to update their list I have to retrieve the products they've selected, delete the products they have deselected, and then insert any new products.

I'd like to point out that I think what will eventually happen is that the actual data will be scattered all over the storage space because you don't delete everything and then re-add it. The optimizer will probably see it more efficient to do a full scan than random seek all over the place with indexes. I don't know this for sure though.

Rob Olmos
+1  A: 

You might want to try storing the selection criteria instead of the products themselves. For instance, store "price < 10 and category = 'sports'" instead of storing the (possibly long) list of products that match those criteria. Then you can recreate the list by applying the selection criteria to the current list of products.

You'll have to figure out what syntax you should use to store the criteria. Maybe SQL will work, maybe you'll want something else. Modifications can be tricky, you'll need to enforce some simple logic to mitigate that, e.g. the criteria must be an OR of ANDs of simple field/value comparisons.

The trouble with this approach is that you need to restrict the users to certain selection criteria, which can go down quite a rathole (lots of users asking for you to implement their own bespoke criteria) if you're not careful. I'm not sure I'd recommend this approach to everyone, but it is another option to consider.

Keith Randall
That'd require using dynamic SQL to use the selection criteria, and it's risky in case products/etc get added inside of that limit. Also makes reporting if a customer bought a specific product *extremely* painful to retrieve...
OMG Ponies
I would consider this approach but the users selections aren't only based on criteria. They can also choose individual products by product ID.
Bill H
One of your criteria can be "productId = ". You'd just be betting on the user selecting only a few products that way, not all 6000.
Keith Randall
OMG, you're right in that if you need to look up which users selected a particular product then this won't work.
Keith Randall
+1 The general suggestion is good if you avoid storing SQL.
grossvogel
A: 

Could you add an extra REPORT_ON column to your association table? The rows in that table would then remain more or less static, and you'd just have to update individual rows and batches of rows when the user actively changed criteria.

Brian Hooper
I'm sorry, I am not fully understanding. Are you suggesting that I create a third column that designates whether to include the item or not?
Bill H
@Bill H: Yes, that is exactly what I suggest.
Brian Hooper