views:

141

answers:

3

I have a set of Django models as shown in the following diagram (the names of the reverse relationships are shown in the yellow bubbles):

Django models

In each relationship, a Person may have 0 or more of the items.

Additionally, the slug field is (unfortunately) not unique; multiple Person records may have the same slug fields. Essentially these records are duplicates.

I want to obtain a list of all records that meet the following criteria: All duplicate records (that is, having the same slug) with at least one Entry OR at least one Audio OR at least one Episode OR at least one Article.

So far, I have the following query:

Person.objects.values('slug').annotate(num_records=Count('slug')).filter(num_records__gt=1)

This groups all records by slug, then adds a num_records attribute that says how many records have that slug, but the additional filtering is not performed (and I don't even know if this would work right anyway, since, given a set of duplicate records, one may have, e.g., and Entry and the other may have an Article).

In a nutshell, I want to find all duplicate records and collapse them, along with their associated models, into one record.

What's the best way to do this with Django?

A: 

Did you look into Django aggregration for 'group by' like behaviour?

http://docs.djangoproject.com/en/dev/topics/db/aggregation/

Klaas van Schelven
Yes; the line of code I gave basically groups the records, it just doesn't do the desired filtering.
mipadi
A: 

I'm not sure that chaining filters is going to get you where you want to go, since there will be Person entries that have two or more types of artifacts with their name on it. Take a look at a previous StackOverflow question and its answer, which I think will help you combine four queries into a single QuerySet in the way you want:

http://stackoverflow.com/questions/431628/how-to-combine-2-or-more-querysets-in-a-django-view

stw_dev
A: 

I would do this in several queries. The first is your list of duplicates, that you have:

dupes = [p['slug'] for p in Person.objects.values('slug').annotate(num_records=Count('slug')).filter(num_records__gt=1)]

I would then loop through these, and for each one decide on which to keep (make an arbitrary decision - pick the first one). Then, for all the other primary keys, just update all the other objects to point to the primary key you have selected:

for slug in dupes:
    pks = [p.id for p in Person.objects.filter(slug=slug)]
    for pk in pks[1:]:
        Audio.objects.filter(person=pk).update(person=pks[0])
        Author.objects.filter(person=pk).update(person=pks[0])
        Episode.objects.filter(person=pk).update(person=pks[0])
        Entry.objects.filter(person=pk).update(person=pks[0])
spookylukey
Accepting this as the answer. I solved the problem a couple days ago, but ended up doing roughly what you suggested.
mipadi