views:

558

answers:

4

I am developing a web application that can support threaded comments. I need the ability to rearrange the comments based on the number of votes received. (Identical to how threaded comments work in reddit)

I would love to hear the inputs from the SO community on how to do it.

How should I design the comments table? Here is the structure I am using now:

Comment
    id
    parent_post
    parent_comment
    author
    points

What changes should be done to this structure?

How should I get the details from this table to display them in the correct manner? (Implementation in any language is welcome. I just want to know how to do it in the best possible manner)

What are the stuff I need to take care while implementing this feature so that there is less load on the CPU/Database?

Thanks in advance.

+2  A: 

Storing trees in a database is a subject which has many different solutions. It depends on if you want to retrieve a subhierarchy as well (so all children of item X) or if you just want to grab the entire set of hierarchies and build the tree in an O(n) way in memory using a dictionary.

Your table has the advantage that you can fetch all comments on a post in 1 go, by filtering on the parentpost. As you've defined the comment's parent in the textbook/naive way, you have to build the tree in memory (see below). If you want to obtain the tree from the DB, you need a different way to store a tree: See my description of a pre-calc based approach here: http://www.llblgen.com/tinyforum/GotoMessage.aspx?MessageID=17746&ThreadID=3208 or by using balanced trees described by CELKO here:

or yet another approach: http://www.sqlteam.com/article/more-trees-hierarchies-in-sql

If you fetch everything in a hierarchy in memory and build the tree there, it can be more efficient due to the fact that the query is pretty simple: select .. from Comment where ParentPost = @id ORDER BY ParentComment ASC

After that query, you build the tree in memory with just 1 dictionary which keeps track of the tuple CommentID - Comment. You now walk through the resultset and build the tree on the fly: every comment you run into, you can lookup its parentcomment in the dictionary and then store the comment currently processed also in that dictionary.

Frans Bouma
+1  A: 

Your current design is basically fine for small hierarchies (less than thousand items)

If you want to fetch on a certian level or depth, add a 'level' item to your structure and compute it as part of the save

If performance is an issue use a decent cache

TFD
+1  A: 

I'd add the following new fields to the above tabel:

  • thread_id: identifier for all comments attached to a specific object

  • date: the comment date (allows fetching the comments in order)

  • rank: the comment rank (allows fetching the comment order by ranking)

Using these fields you'll be able to:

  1. fetch all comments in a thread in a single op
  2. order comments in a thread either by date or rank

Unfortunately if you want to preserve your queries DB close to SQL standard you'll have to recreate the tree in memory. Some DBs are offering special queries for hierarchical data (f.e. Oracle)

./alex

alexpopescu
Alex, thanks for the answer, but I didn't understand some of your points. I think thread_id is the same as post_id, date can be replaced by an auto-incrementing id, rank = points. That is what I have done in my design. Can you please clarify any differences b/w my design and the proposed design?
Niyaz
@Niyaz: I guess you might need to edit your question as I don't see a post_id (and indeed I misunderstood points). thread_id: unique id for all comments in a thread (attached to a piece of conent). autoincrement can provide the order but is not equivalent to a date (check almost all forums).
alexpopescu
I was confused to. If "parent_comment" points to the parent's comment id, I'd name it "parent_comment_id" to clear the air. I'm not sure what "parent_post" means and why it is different from "parent_comment".
Cory R. King
Parent_post refers to the original post (not the any of the comments).I thought it would allow me to get all the comments related to a post with a single query rather than recursing through the parent-child comment structure.
Niyaz
@Niyaz: as far as I get it, parent_post is equivalent to thread_id. So, it looks like the only remaining things to be added are: date, user_details (if needed).
alexpopescu
+1  A: 

Couple things to also consider...

1) When you say "sort like reddit" based on rank or date, do you mean the top-level or the whole thing?

2) When you delete a node, what happens to the branches? Do you re-parent them? In my implementation, I'm thinking that the editors will decide--either hide the node and display it as "comment hidden" along with the visible children, hide the comment and it's children, or nuke the whole tree. Re-parenting should be easy (just set the chidren's parent to the deleted's parent), but it anything involving the whole tree seems to be tricky to implement in the database.

I've been looking at the ltree module for PostgreSQL. It should make database operations involving parts of the tree a bit faster. It basically lets you set up a field in the table that looks like:

ltreetest=# select path from test where path <@ 'Top.Science';
                path                
------------------------------------
 Top.Science
 Top.Science.Astronomy
 Top.Science.Astronomy.Astrophysics
 Top.Science.Astronomy.Cosmology

However, it doesn't ensure any kind of referential integrity on its own. In other words, you can have a records for "Top.Science.Astronomy" without having a record for "Top.Science" or "Top". But what it does let you do is stuff like:

-- hide the children of Top.Science
UPDATE test SET hide_me=true WHERE path @> 'Top.Science';

or

-- nuke the cosmology branch
DELETE FROM test WHERE path @> 'Top.Science.Cosmology';

If combined with the traditional "comment_id"/"parent_id" approach using stored procedures, I'm thinking you can get the best of both worlds. You can quickly traverse the comment tree in the database using your "path" and still ensure referential integrity via "comment_id"/"parent_id". I'm envisioning something like:

CREATE TABLE comments (
comment_id SERIAL PRIMARY KEY,
parent_comment_id int REFERENCES comments(comment_id) ON UPDATE CASCADE ON DELETE CASCADE,
thread_id int NOT NULL  REFERENCES threads(thread_id) ON UPDATE CASCADE ON DELETE CASCADE,
path ltree NOT NULL,
comment_body text NOT NULL,
hide boolean not null default false
);

The path string for a comment look like be

<thread_id>.<parent_id_#1>.<parent_id_#2>.<parent_id_#3>.<my_comment_id>

Thus a root comment of thread "102" with a comment_id of "1" would have a path of:

102.1

And a child whose comment_id is "3" would be:

102.1.3

A some children of "3" having id's of "31" and "54" would be:

102.1.3.31
102.1.3.54

To hide the node "3" and its kids, you'd issue this:

UPDATE comments SET hide=true WHERE path @> '102.1.3';

I dunno though--it might add needless overhead. Plus I don't know how well maintained ltree is.

Cory R. King