views:

139

answers:

1

This is going to be a "long one". I'm including as much code and explanation as possible ... I'm not above throwing out code if necessary.

I'm trying to implement a logical parser in a django query system. Where users can provide complex queries against tags that are applied to samples. This is essentially part of a scientific sample repository where users can apply defined tags (tissue-type, disease studied, etc.). They can then create persistent "baskets" of samples defined by a logical query on these tags.

#models.py

class Sample(models.Model):
    name = models.CharField(max_length = 255)


class Tag(models.Model):
    name = models.CharField(max_length = 255)
    samples = models.ManyToManyField(Sample)

A quick example:
#example data:
Sample1 has TagA, TagB, TagC
Sample2 has       TagB, TagC, TagD
Sample3 has TagA,       TagC, TagD
Sample4 has       TagB

#example query:
'TagB AND TagC AND NOT TagD'

would return Sample1. I use a crazy string-eval hack to create a set of Q() objects:

def MakeQObject(expression):
    """
    Takes an expression and uses a crazy string-eval hack to make the qobjects.
    """
    log_set = {'AND':'&','OR':'|','NOT':'~'}

    exp_f = []
    parts = expression.split()
    #if there is a ) or ( then we can't use this shortcut
    if '(' in parts or ')' in parts:
        return None

    for exp in parts:
        if exp in log_set:
            exp_f.append(log_set[exp])
        else:
            exp_f.append("Q(tags__name__iexact = '%s')" % exp)
    st = ' '.join(exp_f)
    qobj = eval(st)
    return qobj

However, this fails on anything that needs a complicated order of operations or grouping by (). Given the same example data the query: (TagA OR TagB) AND NOT TagD should return Sample1, Sample4 but does not. I've implemented a "one-at-a-time" function which can take a single Sample object and perform the query. However, in my actual database I have ~40,000 Samples and ~400 tags (about ~7 per Sample) and the iterative technique takes ~4 minutes to complete on all Samples. So I calculate the baskets nightly and then just freeze them during the day. I worry that as I start curating more baskets, samples and tags this will scale poorly.

Any suggestions?

+1  A: 

First, to improve performance it will probably help to add an index on tag name field since you are using it for queries. So, add db_index=True to your column:

class Tag(models.Model):
    name = models.CharField(max_length = 255, db_index=True)
    samples = models.ManyToManyField(Sample)

Second, for parsing the user queries I would recommend using one of several good Python-based parsers such as PyParsing or PLY. These might seem intimidating at first but really aren't that hard, especially with a simple grammar such as yours.

If those are too much for you, then try rolling your own using Fredrik's guide Simple Top-Down Parsing in Python.

Van Gale
Additional note: you need to reset your database for db_index to be seen by syncdb. If you can't reset then you need to manually add the index yourself to the database or use a migration tool such as South (http://south.aeracode.org/).
Van Gale
Thanks Van ... I think I can probably use their example for simple math expressions "http://pyparsing.wikispaces.com/file/view/fourFn.py". Just replace the `math.operations` with queryset operations and then let it take care of the nesting, etc.
JudoWill