views:

56

answers:

3

I have a list of products {P1, P2, ...} each of which can have a list of attributes {a1, a2, ...}. What's the fastest algorithm to find all elements NOT having some attributes say {a2, a6, a10}?

If P1 = {a1, a2, a3} P2 = {a3} P3 = {a1, a4} , the algorithm should return {P2, P3}

The issue is I don't know the input list of attributes is as this is passed in by the user. The list of products and their associated attributes are stored in the database:

Product Table (Have more than 10000 rows)

ProductID int,
ProductName varchar

Attribute Table (Have about 400 rows, might grow in the future)

AttributeID int,
AttributeName  varchar

Product_Attribute_Association Table

ProductID int,
AttributeID int

The query I have is:

SELECT p.ProductID, p.ProductName
FROM Product p
WHERE p.ProductID NOT IN 
(SELECT pa.ProductID FROM Product_Attribute_Association pa
 WHERE pa.AttributeID NOT IN (1, 4, 5) -- What ever being passed in
) t

This service will get hit pretty hard and I am thinking about caching the data of the 3 tables in memory in some data structure and write an efficient algorithm to do the lookup. Can you please suggest something I should look into? Thanks

EDIT: Updating the database is not an issue. The cache will be rebuilt from the database every hour, so the time that builds the cache is less of a concern.

Memory is not a concern either.

A: 

It probably depends on how often you will be updating the database, if it's not too frequent you can:

For each attributeId, have a sorted list (or array) of productIds that have it. When a query arrives, take the products lists corresponding to that attributes, merge them, then merge that with the sorted list of productIds.

In your example, this looks like:

  • a1 -> {p1, p3}
  • a2 -> {p1}
  • a3 -> {p1, p2}
  • a4 -> {p3}
  • ...
Ssancho
Updating the database is not an issue. The cache will be updated every hour, so the time that builds the cache is less of a concern.
David
A: 

Here's naive solution:

  • For every attribute, place each of products who owns this attribute into hashtable with attribute used as a key;
  • When user's input arrives, initialize result with all existing products, then iterate over attributes and check if attribute is exists in hashtable, if it is, remove all products associated with this attribute from result;
  • What you have when iteration is finished is your result.
Victor Sorokin
A: 

You could implement the "cache" directly in your product table:

  • Create a binary field "AttributeCache", where each bit represents an attribute
  • Perform a query which bitwise evaluates the cache field

    select ProductID from Product where AttributeCache & :attributeMask = 0

To search for {a2, a6, a10} attributeMask would obviously be (filled up to 16 attributes): 0100010001000000

If your database allows to do so, you can also create an index for the AttributeCache field to avoild full tables scans.

nang