views:

559

answers:

10

I have a database table with hundreds of thousands of forum posts, and I would like to find out what hour-long period contains the most number of posts.

I could crawl forward one minute at a time, keeping an array of timestamps and keeping track of what hour had the most in it, but I feel like there is a much better way to do this. I will be running this operation on a year of posts so checking every minute in a year seems pretty awful.

Ideally there would be a way to do this inside a single database query.

A: 

SELECT  DATEPART(hour, PostDateTime) AS HourOfDay,
        COUNT(*) AS ForumPosts
FROM    Posts
GROUP BY DATEPART(hour, PostDateTime)

ShaneD
+1  A: 

This results in an O(n) database query, and an O(n) greatest time search, for a total complexity of O(2n) (which, of course, is still O(n)):

Use a count distinct command in SQL which will 'bin' items for you in minute increments.

So you'd run the count query on this table:

time
1
2      
4
3
3
2
4
1
3
2

And it would return:

0 1
1 1
2 3
3 3
4 2

By counting each item.

I suspect you can do the same thing with your table, and bin them by the minute, then run an algorithm on that.

SELECT customer_name, COUNT(DISTINCT city) as "Distinct Cities"
FROM customers
GROUP BY customer_name;

From this tutorial on count: http://www.techonthenet.com/sql/count.php (near the end).

Here is a similar page from MySQL's manual: http://dev.mysql.com/doc/refman/5.1/en/counting-rows.html

So if you have a table with a timedate in it (to the minute, allowing binning to happen by minutes):

datetime (yyyymmddhhmm)
200901121435
200901121538
200901121435
200901121538
200901121435
200901121538
200901121538
200901121435
200901121435
200901121538
200901121435
200901121435

Then the SQL

SELECT datetime, COUNT(DISTINCT datetime) as "Date Time"
FROM post
GROUP BY datetime;

should return

200901121435 7
200901121538 5

You will still need to post process this, but the hard work of grouping and counting is done, and will only result in just over 500k rows per year (60 minutes, 24 hours, 365 days)

The post processing would be:

Start at time T = first post time.
Set greatestTime = T
Sum all counts between T and T+one hour --> currentHourCount and greatestHourCount
While records exist past T+one hour
   Increment T by one minute.
   While the first element is prior to time T, subtract it
   while the last element is before time T+ one hour, add it
   If currentHourCount > greatestHourCount then
      greatestHourCount = currentHourCount
      greatestTime = T
end while
Adam Davis
Thanks. If the "single query" solution above ends up being too hard on the database server I will probably end up using this method, as its a great way to split the work between the database server and the code server.
OverloadUT
@recursive: ayup.
Adam Davis
A: 

If mysql:

select substr( timestamp, 1, 16 ) as hour, count(*) as count from forum_posts group by hour order by count desc limit 1;

edit: not sure if original question means any possible 60-minute period

Chris J
Yeah I should have made that more clear. The problem is easy to solve if I only want to consider each "clock hour" but I want to consider any 60-minute period.
OverloadUT
A: 

If using MySQL:

SELECT DATE(postDate), HOUR(postDate), COUNT(*) AS n
FROM posts
GROUP BY DATE(postDate), HOUR(postDate)
ORDER BY n DESC
LIMIT 1
Alnitak
+4  A: 

Binning will work if you want to look at intervals such as 10:00 - 11:00. However if you had a sudden flurry of interest from 10:30 - 11:30 then it will be split across two bins, and hence may be hidden by an smaller number of hits that happened to fit entirely within a single clock hour.

The only way to avoid this problem is to generate a list sorted by time and step through it. Something like this:

max = 0; maxTime = 0
for each $item in the list:
   push $item onto queue
   while head of queue is more than an hour before $item
      drop queue head.
   if queue.count > max then max = queue.count; maxTime = $item.time

That way you only need to hold a 1 hour window in memory rather than the whole list.

Paul Johnson
Yeah, this method is the best method I could come up with on my own. I'll do it if I have to, but I was hoping there might be a way to do it without having to step through hundreds of thousands of items.
OverloadUT
Amazing how many people ignore the fact that grouping by the hour doesn't do it!
Loren Pechtel
Only a few hundred thousand? Perl!
Paul Johnson
+5  A: 

Given a table filled with every minute in the year you are interested in Minutes and a table Posts with a Time column:

select top 1 minutes.time, count (posts.time)
from Minutes
   left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count (posts.time) desc

To solve generating the minutes table, you can use a function like ufn_GenerateIntegers. Then the function becomes

select top 5 minutes.time, count (posts.time)
from (select dateadd(minute, IntValue, '2008-01-01') as Time from ufn_GenerateIntegers(525600)) Minutes
   left join posts on posts.time >= minutes.time AND posts.time < dateadd(hour, 1, Minutes.Time)
group by minutes.time
order by count(posts.time) desc

I just did a test run with about 5000 random posts and it took 16 seconds on my machine. So, not trivial, but not rediculous for the occasional one-off query. Fortunately, this is a data-point you can calculate one a day or even once a month and cache if you want to display the value frequently.

Take a look at lassevk's improvement.

Eclipse
Ah ha! This is the kind of thing I was looking for! I'll have to test to see how long running this query will take as my database server has a lot less resources than my php server, but it's definitely more in the direction I was hoping to go.
OverloadUT
Yeah, this one could take a long time - I haven't tried it out at all. But I have done similar things with day-long resolutions.
Eclipse
I'm pretty sure I read a Daily WTF about this design just a few days ago...
rmeador
Its a WTF if you're putting this into a frequently run query in code, but not if you're just curious once in a while.
Eclipse
Yeah, this would only need to be run once a year to calculate the previous year's stats, and once a month for each month's stats and once a week for each week's stats, etc etc.
OverloadUT
+2  A: 

Treat the timestamp of every post as the start of such an hour, and count all other posts that fall within that hour, including the post that started it. Sort the resulting hours in descending order by the number of posts in each of them.

Having done that, you'll find the topmost single "hour" that has the most posts in it, but this period of time might not be exactly one hour long, it might be shorter (but never longer).

To get a "prettier" period, you can calculate how long it really is, divide by two, and adjust the start of the period back by that amount and the end forward, this will "center" the posts inside the hour. This adjustment is guaranteed to not include any new posts, so the count is still valid. If posts are close enough to suddenly be included in the period after you have expanded it to one hour, then an earlier point would've had "the most posts" in it instead of the one you picked.

If this is an SQL question, you can reuse the SQL that Josh posted here, just replace the Minutes table with another link to your posts table.


Another method you can use is to use a sliding window.

First sort all the posts according to the timestamp. Keep track of posts using a list, a linked list could be used for this.

Now, for each post, add it to the end of the list. Then, for each post from the start of the list, if that post is more than one hour before the post you just added, remove it from the list.

After doing that 2-step operation for a single new post in the list, check if the number of posts in the list is more than a previous maximum, and if it is, either make a copy of the list or at least store the post you just added.

After you're finished, you've got the "copy of the list" with the most posts in an hour, or you got the post that is the end of a 1-hour window that contains the most posts.

Pseudo-code:

initialize posts-window-list to empty list
for each post in sorted-posts-list:
    add post to end of posts-window-list
    for each other-post from start of posts-window-list:
        if other-post is more than one hour older than post, remove it
        otherwise, end this inner loop
    if number of posts in list is more than previous maximum:
        make copy of list, this is the new maximum
Lasse V. Karlsen
+2  A: 

This worked on a small test MS-SQL database.

SELECT TOP 1 id, date_entered,
  (SELECT COUNT(*)
   FROM   dbo.notes AS n2
   WHERE n2.date_entered >= n.date_entered 
   AND n2.date_entered < Dateadd(hh, 1, n.date_entered)) AS num
FROM  dbo.notes n
ORDER BY num DESC

This is not very efficient, checks based on an hour from each post.

For MYSQL 

SELECT ID,f.Date, (SELECT COUNT(*)
FROM Forum AS f2
WHERE f2.Date >= f.Date AND f2.Date < Date_ADD(f.Date, INTERVAL 1 HOUR)) As num
FROM Forum AS f
ORDER BY num
LIMIT 0,1
Cadoo
+1  A: 

Here's a slight variation on the other Josh's implementation this forgoes the immediate table and uses a self join on itself looking for any posts within an hour of that one post.

select top 1 posts.DateCreated, count (posts.datecreated),
min(minutes.DateCreated) as MinPostDate,
max(minutes.datecreated) as MaxPostDate
from posts Minutes   
left join posts on posts.datecreated >= minutes.DateCreated 
AND posts.datecreated < dateadd(hour, 1, Minutes.DateCreated)
group by posts.DateCreated
order by count(posts.datecreated) desc

From a performance perspective on a table with only 6 rows his method which used the function to generate the intermiadte table took 16 seconds vs this one which was subsecond.

I'm not positive if it would be possible using this to miss a valid timeframe since the timespan is based off of the offset of each post.

JoshBerke
+1  A: 

This will do it.

SELECT DateOfEvent HourBegin, DATEADD(hh, 1, DateOfEvent)) HourEnd, COUNT(*) AS NumEventsPerHour FROM tEvents AS A JOIN tEvents AS B ON A.DateOfEvent >= B.DateOfEvents AND DATEADD(hh, 1, A.DateOfEvent) <= B.DateOfEvent GROUP BY A.DateOfEvent