views:

447

answers:

7

What would be the best way to design a threaded commenting system so that it doesn't hammer the database?

A: 

What I normally do in this case is to have a single thread that is responsible for putting the data into the database, and have all auxiliary threads report to that thread, which then queues up the data, and writes it either serially, or in batches (depending on the requirements, and how much database activity I'm willing to put up with).

DannySmurf
+1  A: 

I'm guessing your question is about arranging the system so you don't have to work as:

  1. Select all the top level comments
  2. Select all comments whose parents were found in the step prior
  3. Select all comments whose parents were found in the step prior
  4. ... repeat until no comments found

I would suggest desiging the db table with a thread key which would be string of all the parents of that post. You'd have to limit your discussion to a certain depth, but your sql statements would be straight selects and order by the thread key, giving you back threaded comments. Less taxing on your DB and Webserver.

A thread key would be something like it's current post id joined onto it's parent's thread key with a delimiter.

How does that sound?

aryeh
A: 

I'm guessing you have something resembling a "comments" table, with a foreign key to itself, pointing to the parent comment of each row. This makes the threaded comments into a tree structure with the thread starter as the tree root.

So we can rephrase the question as "What is the best way to select a tree structure from a database?". Well I won't assume to know the best way, but my first inclination (probably wrong) is to use a stored procedure to walk the tree, and compile a list of rows to return. It still takes multiple select statements to get all the children, but it's only one database round trip.

Aryeh's method with the accumulated parent list is probably better :)

Christian Oudard
+1  A: 

This website lists some common techniques: http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

I'd do the "nested set" model, but have multiple roots (e.g. each "topic" is a new tree). It's very fast, simple to query, but complicated to maintain...

Matt Rogish
+1  A: 

SELECT ... START WITH ... CONNECT BY

Oracle has an extension to SELECT that allows easy tree-based retrieval.

This query will traverse a table where the nesting relationship is stored in parent and child columns.

select * from my_table
    start with parent = :TOP_ARTICLE
    connect by prior child = parent;

http://www.adp-gmbh.ch/ora/sql/connect_by.html

Mark Harrison
+2  A: 

Modified pre-order tree traversal (or what Matt refers to as "nested set") is the way to go.

If you happen to be working in Django, there's a third-party app, django-mptt, that makes implementing MPTT in your models a one-liner.

Carl Meyer
A: 

I have to second Carl Meyer's suggestions to use the link text technique. I'm working on a system like this right now but with some further optimizations for a forums.

In forum systems that support replies you will frequently be doing inserts into the middle of the tree which yields poor performance. To reduce the pain I'm working on allowing gaps in the number line. This works like pre-allocating memory in an array list. The cost for shifting the tree to the right is the same for 1 and for 100. So on successive replies (which are more likely) I can update fewer tree nodes and they will be much faster.

The downsize is the counting of descendant nodes (number of replies below this post) by comparing the current nodes left and right values will break. This information can be cached in the data structure to make that fast. So on insert I will have to update all ancestor nodes with new counts. Still far fewer nodes will be updated and the result will be much faster average case insert times.

Multiple trees are being stored in the same table. Each tree has a unique tree id, one per topic. Smaller trees update much faster.

Hope That Helps

Gareth Farrington