+2  A: 

Do you really need to precompute all the possible pairs? What if you were to do it lazily, i.e. on an on-demand basis?

This can be represented as a 2D matrix. The rows correspond to the customers and the columns correspond to the products.

Each entry is either 0 or 1, saying whether the product corresponding to the column was bought by the customer corresponding to the row.

If you look at each column as a vector of (around 5000) 0s and 1s, then the number of times two products were bought together is just the dot product of the corresponding vectors!

Thus you can just compute those vectors first, and then compute the dot product lazily, on demand.

To compute the dot-product:

Now, a good representation of a vector with only 0s and 1s is an array of integers, which is basically a bitmap.

For 5000 entries, you will need an array of 79 64-bit integers.

So given two such arrays, you need to count the number of 1s that are common.

To count the number of bits common to two integers, first you can do the bitwise AND and then count the numbers of 1s that are set in the resulting number.

For this either you can use lookup tables or some bitcounting methods (not sure if python will support them), like here: http://graphics.stanford.edu/~seander/bithacks.html

So your algorithm will be something like this:

  • Initialize an array of 79 64 bit integers for each product.

  • For each customer, look at the products bought and set the appropriate bit for that customer in that corresponding products.

  • Now given a query of two products for which you need to know the number of customer who bought them together, just take the dot-product as describe above.

This should be reasonably fast.

As a further optimization, you can possibly consider grouping the customers.

Moron
I read "groping the customers" there for a second...
Tim Pietzcker
@Tim: Hehe.....
Moron
I can't believe I didn't think to use the dot-product myself, especially coming off of another project involving cosine similiarity! Thanks. Enjoy your upvote.
Neil Kodner
However, if I create a 2D matrix, won't Order-N-Squared be high because I'm comparing every item against every other item? One of the reasons I'm looking at using Jaccard/Tanimoto is that it saves me from having to compare non-related items.
Neil Kodner
@Neil: You are creating a vector for each product. Initializing a vector is O(M) (M is number of customers), but that is really fast for an in-memory bitmap, where you can zero out in blocks. Once you have the initialization done, the processing cost is O(S) where S is the number of 1s, and then O(M) for each query (given two products). Your problem is basically determining the size of set intersection, so depending on the sparsity, it might be better to represent your sets using dicts to represent the set of customers who bought a product. For ~5K customers it really might not matter.
Moron
...continuing... Dicts have a overhead of computing hash keys etc, which bitmaps don't have. So the choice really depends on your data like sparsity etc. Of course, dicts are easier to code up I suppose. btw, a bitmap/vector is just another form of hashing, just like dicts are.
Moron
Moron, so i exaggerated a little bit for simplicity's sake - how about tens (maybe hundreds) of thousands of customers and tens of thousands of items?
Neil Kodner
@Neil. Then you could consider a combination of dicts and bitmaps (the grouping of customers I was talking about in my answer). You group customers into groups G1, G2... and then maintain a dict for each product, keyed on the group. The values stored will be the bitmaps/vectors for the groups in the dict. Using only dicts corresponds to each customer being in his own group. Using just bitmap corresponds to all customers in one group. If you group the customers based on similar tastes, you might get good results. You could even do clever stuff with overlaps between groups etc...
Moron
+2  A: 
events = """\
1-hammer 
1-screwdriver 
1-nails 
2-hammer 
2-nails 
3-screws 
3-screwdriver 
4-nails 
4-screws""".splitlines()
events = sorted(map(str.strip,e.split('-')) for e in events)

from collections import defaultdict
from itertools import groupby

# tally each occurrence of each pair of items
summary = defaultdict(int)
for val,items in groupby(events, key=lambda x:x[0]):
    items = sorted(it[1] for it in items)
    for i,item1 in enumerate(items):
        for item2 in items[i+1:]:
            summary[(item1,item2)] += 1
            summary[(item2,item1)] += 1

# now convert raw pair counts into friendlier lookup table
pairmap = defaultdict(dict)
for k,v in summary.items():
    item1, item2 = k
    pairmap[item1][item2] = v

# print the results    
for k,v in sorted(pairmap.items()):
    print k,':',v

Gives:

hammer : {'nails': 2, 'screwdriver': 1}
nails : {'screws': 1, 'hammer': 2, 'screwdriver': 1}
screwdriver : {'screws': 1, 'nails': 1, 'hammer': 1}
screws : {'nails': 1, 'screwdriver': 1}

(This addresses your initial request grouping items by purchase event. To group by user, just change the first key of the events list from event number to user id.)

Paul McGuire
you had me at "events = sorted(map(str.strip,e.split('-')) for e in events)"
Neil Kodner
(that was a compliment)
Neil Kodner
@Neil - thanks! For Python 3, it will be `events = sorted(tuple(map(str.strip,e.split('-'))) for e in events)`.
Paul McGuire
+1  A: 

Paul's answer might be the best, but here's what I came up with over my lunch break (untested, admittedly, but still a fun exercise in thinking). Not sure of the quickness/optimization of my algorithm. I'd personally suggest looking at something like MongoDB, a NoSQL database, since it seems it may lend itself nicely to solving this kind of problem (what with map/reduce and all)

# assuming events is a dictionary of id keyed to item bought...
user = {}
for (cust_id, item) in events:
    if not cust_id in users:
        user[cust_id] = set()
    user[cust_id].add(item)
# now we have a dictionary of cust_ids keyed to a set of every item
# they've ever bought (given that repeats don't matter)
# now we construct a dict of items keyed to a dictionary of other items
# which are in turn keyed to num times present
items = {}
def insertOrIter(d, k, v):
    if k in d:
        d[k] += v
    else:
        d[k] = v
for key in user:
    # keep track of items bought with each other
    itemsbyuser = []
    for item in user[key]:
        # make sure the item with dict is set up
        if not item in items:
            items[item] = {}
        # as we see each item, add to it others and others to it
        for other in itemsbyuser:
            insertOrIter(items[other], item, 1)
            insertOrIter(items[item], other, 1)
        itemsbyuser.append(item)
# now, unless i've screwed up my logic, we have a dictionary of items keyed
# to a dictionary of other items keyed to how many times they've been
# bought with the first item. *whew* 
# If you want something more (potentially) useful, we just turn that around to be a
# dictionary of items keyed to a list of tuples of (times seen, other item) and
# you're good to go.
useful = {}
for i in items:
    temp = []
    for other in items[i]:
        temp[].append((items[i][other], other))
    useful[i] = sorted(temp, reverse=True)
# Now you should have a dictionary of items keyed to tuples of
# (number times bought with item, other item) sorted in descending order of
# number of times bought together
nearlymonolith
Embrace the defaultdict! Never again check on the existence of a dict key before updating its value - just access the key and let the defaultdict initialize it if it isn't there. The easiest code to maintain is the code that doesn't exist.
Paul McGuire
+1  A: 

Rather strange seeing that every time you want to get the stats, all solutions above churn through entire database to get counts.

Would suggest to keep the data in flat, indexes and only get results for specific item, one at the time. If your item count is large, it will me more efficient.

from collections import defaultdict
from itertools import groupby

class myDB:
    '''Example of "indexed" "database" of orders <-> items on order'''
    def __init__(self):
        self.id_based_index = defaultdict(set) 
        self.item_based_index = defaultdict(set)

    def add(self, order_data):
        for id, item in order_data:
            self.id_based_index[id].add(item)
            self.item_based_index[item].add(id)

    def get_compliments(self, item):
        all_items = []
        for id in self.item_based_index[item]:
            all_items.extend(self.id_based_index[id])
        gi = groupby(sorted(all_items), lambda x: x)
        return dict([(k, len(list(g))) for k, g in gi])

Example of using it:

events = """1-hammer 
    1-screwdriver 
    1-nails 
    2-hammer 
    2-nails 
    3-screws 
    3-screwdriver 
    4-nails 
    4-screws"""

db = myDB()
db.add(
    [ map(str.strip,e.split('-')) for e in events.splitlines() ]
    )
# index is incrementally increased 
db.add([['5','plunger'],['5','beer']])

# this scans and counts only needed items
assert db.get_compliments('NotToBeFound') == {}
assert db.get_compliments('hammer') == {'nails': 2, 'hammer': 2, 'screwdriver': 1}
# you get back the count for the requested product as well. Discard if not needed.

This is all fun, but, seriously, just go for real database storage. Because indexing is already built into any DB engine, all of the above code in SQL would just be:

select
    p_others.product_name,
    count(1) cnt
from products p
join order_product_map opm
    on p.product_id = opm.product_id
join products p_others
    on opm.product_id = p_others.product_id
where p.product_name in ('hammer')
group by p_others.product_name
ddotsenko
The solution I proposed can be used in the caching layer, which will not require any DB for different queries. In fact, if updates occurs you can update the structure (very trivial to do) and then do a delayed write to DB or whatever. Besides, OP asked for a data structure, and he got it (and not just in my answer, but the others too). Going to the database each time does not scale (especially if you have multiple joins in your queries!). Caching becomes a must. Also, I presume if OP wanted a SQL query, he would have asked for it.
Moron
Agreed - the solutions posted weren't suggesting that this elaborate process be done every time, but rather that doing said process would get you the data you want (which could then be stored, indexed, put in a database, whatever the case may be). It's a server side operation running intermittently, and pages would just access the data assuming it were already computed. As to the SQL, that's why I suggested MongoDB - I had a similar thought that this code would be nicely accomplished natively on the DB. That's a nice bit of SQL, though (never thought I'd say that).
nearlymonolith