views:

274

answers:

4

Hi,

I think it's easier to explain my problem with an example.

I have one table with ingredients for recipes and I have implemented a function to calculate the Tanimoto coefficient between ingredients. It's fast enough to calculate the coefficient between two ingredients (3 sql queries needed), but it does not scale well. To calculate the coefficient betweeen all possible ingredient's combination it needs N + (N*(N-1))/2 queries or 500500 queries for just 1k ingredients. Is there a faster way to do that? Here's what I got so far:

class Filtering():
  def __init__(self):
    self._connection=sqlite.connect('database.db')

  def n_recipes(self, ingredient_id):
    cursor = self._connection.cursor()
    cursor.execute('''select count(recipe_id) from recipe_ingredient
        where ingredient_id = ? ''', (ingredient_id, ))
    return cursor.fetchone()[0]

  def n_recipes_intersection(self, ingredient_a, ingredient_b):
    cursor = self._connection.cursor()
    cursor.execute('''select count(drink_id) from recipe_ingredient where
        ingredient_id = ? and recipe_id in (
        select recipe_id from recipe_ingredient
        where ingredient_id = ?) ''', (ingredient_a, ingredient_b))
    return cursor.fetchone()[0]

  def tanimoto(self, ingredient_a, ingredient_b):
    n_a, n_b = map(self.n_recipes, (ingredient_a, ingredient_b))
    n_ab = self.n_recipes_intersection(ingredient_a, ingredient_b)
    return float(n_ab) / (n_a + n_b - n_ab)
+2  A: 

Why aren't you simply fetching all recipes into memory and then computing Tanimoto coefficients in memory?

It's simpler and it's much, much faster.

S.Lott
That was my first idea, but how would you implement it? Looping through all recipes' ingredients and incrementing the counters for each ingredient and combination found? I have > 60k items in the database so even that would take some time.
jbochi
Facepalm! This approach turned out to be much faster than I imagined. It took only 4s to compute all coefficients. Thanks.
jbochi
Generally, that's my experience. People write too much SQL.
S.Lott
A: 

I think this will cut you down to 2 selects per pair for the intersection, and 4 queries per pair total. You can't get away from O(N^2) since you are trying all pairs -- N*(N-1)/2 is simply how many pairs there are.

def n_recipes_intersection(self, ingredient_a, ingredient_b):
  cursor = self._cur
  cursor.execute('''
    select count(recipe_id)
      from recipe_ingredient as A 
        join recipe_ingredient as B using (recipe_id)
      where A.ingredient_id = ? 
        and B.ingredient_id = ?;
      ''', (ingredient_a, ingredient_b))
  return cursor.fetchone()[0]
Shane Holloway
+1  A: 

If you have 1000 ingredients, 1000 queries will suffice to map each ingredient to a set of recipes in memory. If (say) an ingredient is typically part of about 100 recipes, each set will take a few KB, so the whole dictionary will take just a few MB -- absolutely no problem to hold the whole thing in memory (and still not a serious memory problem if the average number of recipes per ingredient grows by an order of magnitude).

result = dict()
for ing_id in all_ingredient_ids:
    cursor.execute('''select recipe_id from recipe_ingredient
        where ingredient_id = ?''', (ing_id,))
    result[ing_id] = set(r[0] for r in cursor.fetchall())
return result

After those 1000 queries, every one of the needed 500,000 computations of pairwise Tanimoto coefficients is then obviously done in-memory -- you can precompute the squares of the lengths of the various sets as a further speedup (and park them in another dict), and the key "A dotproduct B" component for each pair is of course the length of the sets' intersection.

Alex Martelli
Thanks Alex! +1 for your good advice, but I managed to do the whole computation in memory fetching all data at once. The whole thing took less than 4s.
jbochi
+1  A: 

If anybody is interested, this is the code that I came up with after Alex's and S.Lotts's suggestions. Thank you guys.

def __init__(self):
    self._connection=sqlite.connect('database.db')
    self._counts = None
    self._intersections = {}

def inc_intersections(self, ingredients):
    ingredients.sort()
    lenght = len(ingredients)
    for i in xrange(1, lenght):
        a = ingredients[i]
        for j in xrange(0, i):
            b = ingredients[j]
            if a not in self._intersections:
                self._intersections[a] = {b: 1}
            elif b not in self._intersections[a]:
                self._intersections[a][b] = 1
            else:
                self._intersections[a][b] += 1


def precompute_tanimoto(self):
    counts = {}
    self._intersections = {}

    cursor = self._connection.cursor()
    cursor.execute('''select recipe_id, ingredient_id
        from recipe_ingredient
        order by recipe_id, ingredient_id''')
    rows = cursor.fetchall()            

    print len(rows)

    last_recipe = None
    for recipe, ingredient in rows:
        if recipe != last_recipe:
            if last_recipe != None:
                self.inc_intersections(ingredients)
            last_recipe = recipe
            ingredients = [ingredient]
        else:
            ingredients.append(ingredient)

        if ingredient not in counts:
            counts[ingredient] = 1
        else:
            counts[ingredient] += 1

    self.inc_intersections(ingredients)

    self._counts = counts

def tanimoto(self, ingredient_a, ingredient_b):
    if self._counts == None:
        self.precompute_tanimoto()

    if ingredient_b > ingredient_a:
        ingredient_b, ingredient_a = ingredient_a, ingredient_b

    n_a, n_b = self._counts[ingredient_a], self._counts[ingredient_b]
    n_ab = self._intersections[ingredient_a][ingredient_b]

    print n_a, n_b, n_ab

    return float(n_ab) / (n_a + n_b - n_ab)
jbochi