ansaurus

Question

SQL to calculate the Tanimoto Coefficient of several vectors

Answer 1

+2 A:

Why aren't you simply fetching all recipes into memory and then computing Tanimoto coefficients in memory?

It's simpler and it's much, much faster.

S.Lott 2010-01-02 18:00:57

That was my first idea, but how would you implement it? Looping through all recipes' ingredients and incrementing the counters for each ingredient and combination found? I have > 60k items in the database so even that would take some time.

jbochi 2010-01-02 18:15:04

Facepalm! This approach turned out to be much faster than I imagined. It took only 4s to compute all coefficients. Thanks.

jbochi 2010-01-02 19:25:09

Generally, that's my experience. People write too much SQL.

S.Lott 2010-01-02 20:56:17

Answer 2

A:

I think this will cut you down to 2 selects per pair for the intersection, and 4 queries per pair total. You can't get away from O(N^2) since you are trying all pairs -- N*(N-1)/2 is simply how many pairs there are.

def n_recipes_intersection(self, ingredient_a, ingredient_b):
  cursor = self._cur
  cursor.execute('''
    select count(recipe_id)
      from recipe_ingredient as A 
        join recipe_ingredient as B using (recipe_id)
      where A.ingredient_id = ? 
        and B.ingredient_id = ?;
      ''', (ingredient_a, ingredient_b))
  return cursor.fetchone()[0]

Shane Holloway 2010-01-02 18:15:37

Answer 3

+1 A:

If you have 1000 ingredients, 1000 queries will suffice to map each ingredient to a set of recipes in memory. If (say) an ingredient is typically part of about 100 recipes, each set will take a few KB, so the whole dictionary will take just a few MB -- absolutely no problem to hold the whole thing in memory (and still not a serious memory problem if the average number of recipes per ingredient grows by an order of magnitude).

result = dict()
for ing_id in all_ingredient_ids:
    cursor.execute('''select recipe_id from recipe_ingredient
        where ingredient_id = ?''', (ing_id,))
    result[ing_id] = set(r[0] for r in cursor.fetchall())
return result

After those 1000 queries, every one of the needed 500,000 computations of pairwise Tanimoto coefficients is then obviously done in-memory -- you can precompute the squares of the lengths of the various sets as a further speedup (and park them in another dict), and the key "A dotproduct B" component for each pair is of course the length of the sets' intersection.

Alex Martelli 2010-01-02 18:26:16

Thanks Alex! +1 for your good advice, but I managed to do the whole computation in memory fetching all data at once. The whole thing took less than 4s.

jbochi 2010-01-02 19:32:58

Answer 4

+1 A:

If anybody is interested, this is the code that I came up with after Alex's and S.Lotts's suggestions. Thank you guys.

def __init__(self):
    self._connection=sqlite.connect('database.db')
    self._counts = None
    self._intersections = {}

def inc_intersections(self, ingredients):
    ingredients.sort()
    lenght = len(ingredients)
    for i in xrange(1, lenght):
        a = ingredients[i]
        for j in xrange(0, i):
            b = ingredients[j]
            if a not in self._intersections:
                self._intersections[a] = {b: 1}
            elif b not in self._intersections[a]:
                self._intersections[a][b] = 1
            else:
                self._intersections[a][b] += 1


def precompute_tanimoto(self):
    counts = {}
    self._intersections = {}

    cursor = self._connection.cursor()
    cursor.execute('''select recipe_id, ingredient_id
        from recipe_ingredient
        order by recipe_id, ingredient_id''')
    rows = cursor.fetchall()            

    print len(rows)

    last_recipe = None
    for recipe, ingredient in rows:
        if recipe != last_recipe:
            if last_recipe != None:
                self.inc_intersections(ingredients)
            last_recipe = recipe
            ingredients = [ingredient]
        else:
            ingredients.append(ingredient)

        if ingredient not in counts:
            counts[ingredient] = 1
        else:
            counts[ingredient] += 1

    self.inc_intersections(ingredients)

    self._counts = counts

def tanimoto(self, ingredient_a, ingredient_b):
    if self._counts == None:
        self.precompute_tanimoto()

    if ingredient_b > ingredient_a:
        ingredient_b, ingredient_a = ingredient_a, ingredient_b

    n_a, n_b = self._counts[ingredient_a], self._counts[ingredient_b]
    n_ab = self._intersections[ingredient_a][ingredient_b]

    print n_a, n_b, n_ab

    return float(n_ab) / (n_a + n_b - n_ab)

jbochi 2010-01-02 19:37:09

ansaurus

tags:

views:

answers:

SQL to calculate the Tanimoto Coefficient of several vectors

related questions