tags:

views:

132

answers:

3

What is the most efficient way to determine how many comments a particular blog post has? We want to store the data for a new web app. We have a list of permalink URl's as well as the RSS feeds.

+3  A: 

If the blog is controlled by you, a "Select count(commentid) FROM comments WHERE postID = 2" will possibly the best thing. If you only have the URL but still it's your blog/db, you need to create a subquery "WHERE postID = (SELECT whatever FROM posts WHERE permalink = url)" or whatever your way to join the comments to the posts from a URL.

If it's a remote blog, you have the problem that each blog has different HTML. Essentially, you're going to need to build a parser that parses the HTML and looks for repeating elements like "div class=comment". But that will be mostly a manual labour for each different blogs.

Some blogs may have better ways like a comment count somewhere in the HTML or some interface, but i'm not aware of any standardized way.

EDIT: If you got a Comment-RSS feed, you may have luck using a mechanism that counts XML nodes, like XPath's Count.

Michael Stum
+1  A: 

If I understand correctly, you want a heuristic to estimate the number of comments in an HTML page which is known to be a blog post, yes?

Very often, a specific blog will have some features which make it easy to work out. If you look at mine over at http://kstruct.com/ you'll see that all the pages with comments say 'X Responses', so if you were able to do some work on a per blog basis, it's probably not really difficult.

If you needed something generic, I guess there are a few common features that comments have that you might be able to detect. For one, any links in them are quite likely to have rel="nofollow" attributes, so seeing that within a block might imply that it's a comment.

The main interesting thing to look for would be changes in the structure of posts for m the same site. For example, there's also a very good chance that each comment will have its own anchor so people can link directly to it, so you could look at the differing numbers of <a name="XXX"> tags in a given page on the same site to get an idea of the relative numbers of comments.

As Michael Stum pointed out, if the pages have a Comment-RSS feed, your life is made a lot easier because you can get the comment data in a structured format.

All in all, though, I think it's going to be quite a challenging problem to solve in general.

Matt Sheppard
A: 

Blogs almost always have an RSS feed for comments. If you have that, then you can determine the exact number of comments, since the feeds 99% of the time follow a standard. Even if the blog is your own, if you are already generating an RSS feed, then don't bother making a call to your DB. You already did that to generate the feed, so it makes sense that you would just traverse the XML nodes. That way you don't have additional overhead (depending on how often you want to get this information).

hal10001