views:

317

answers:

2

For a website implemented in Django/Python we have the following requirement:

On a view page there are 15 messages per web paging shown. When there are more two or more messages from the same source, that follow each other on the view, they should be grouped together.

Maybe not clear, but with the following exemple it might be:

An example is (with 5 messages on a page this time):

  Message1 Source1
  Message2 Source2
  Message3 Source2
  Message4 Source1
  Message5 Source3
  ...

This should be shown as:

Message1 Source1
Message2 Source2 (click here to 1 more message from Source2)
Message4 Source1
Message5 Source3
Message6 Source2

So on each page a fixed number of items is shown on page, where some have been regrouped.

We are wondering how we can create a Django or MySQL query to query this data in a optimal and in an easy way. Note that paging is used and that the messages are sorted by time.

PS: I don't think there is a simple solution for this due to the nature of SQL, but sometimes complex problems can be easily solved

+1  A: 

I have a simple, though not perfect, template-only solution for this. In the template you can regroup the records using the regroup template tag. After regrouping you can hide successive records from the same source:

{% regroup records by source as grouped_records %}
{% for group in grouped_records %}
  {% for item in group.list %}
    <li{% if not forloop.first %} style="display:none"{% endif %}>
       {{ item.message }} {{ iterm.source }}
       {% if forloop.first %}
         {% ifnotequal group.list|length 1 %}
           <a href="#" onclick="...">Show more from the same source...</a>
         {% endifnotequal %}           
       {% endif %}
    </li>
  {% endfor %}
{% endfor %}

This would be perfect if it wasn't for one thing: Pagination. If you mean to display 15 items per page, and on one page the first five are fromone source, next five from another, and the last five yet another, there would be only three visible items on the page.

Guðmundur H
+3  A: 

I don't see any great way to do what you're trying to do directly. If you're willing to accept a little de-normalization, I would recommend a pre-save signal to mark messages as being at the head.

#In your model
head = models.BooleanField(default=True)

#As a signal plugin:
def check_head(sender, **kwargs):
    message = kwargs['instance']
    if hasattr(message,'no_check_head') and message.no_check_head:
        return
    previous_message = Message.objects.filter(time__lt=message.time).order_by('-time')[0]
    if message.source == previous_message.source:
        message.head = False
    next_message = Message.objects.filter(time__gt=message.time).order_by('time')[0]
    if message.source == next_message.source:
        next_message.head = False
        next_message.no_check_head
        next_message.save()

Then your query becomes magically simple:

messages = Message.objects.filter(head=True).order_by('time')[0:15]

To be quite honest...the signal listener would have to be a bit more complicated than the one I wrote. There are a host of lost synchronization/lost update problems inherent in my approach, the solutions to which will vary depending on your server (if it is single-processed, multi-threaded, then a python Lock object should get you by, but if it is multi-processed, then you will really need to implement locking based on files or database objects). Also, you will certainly also have to write a corresponding delete signal listener.

Obviously this solution involves adding some database hits, but they are on edit as opposed to on view, which might be worthwhile for you. Otherwise, perhaps consider a cruder approach: grab 30 stories, loop through the in the view, knock out the ones you won't display, and if you have 15 left, display them, otherwise repeat. Definitely an awful worst-case scenario, but perhaps not terrible average case?

If you had a server configuration that used a single process that's multi-threaded, a Lock or RLock should do the trick. Here's a possible implementation with non-reentrant lock:

import thread
lock = thread.allocate_lock()
def check_head(sender, **kwargs):
    # This check must come outside the safe zone
    # Otherwise, your code will screech to a hault
    message = kwargs['instance']
    if hasattr(message,'no_check_head') and message.no_check_head:
        return
    # define safe zone
    lock.acquire()
    # see code above
    ....
    lock.release()

Again, a corresponding delete signal is critical as well.

EDIT: Many or most server configurations (such as Apache) will prefork, meaning there are several processes going on. The above code will be useless in that case. See this page for ideas on how to get started synchronizing with forked processes.

David Berger
I think this is the general approach you'd have to take, though you could replace the database hits by storing last_source in memcached; there'd still be a potential race condition under heavy concurrency, but if it didn't have to be 100% perfect...
Carl Meyer
As you said, not the perfect option, but I still think is the best. Thanks
Michael
Also...I guess I've always liked using signals...but overriding the save method could do the same thing. Some stylistic purists would argue that that's "the right thing".
David Berger