views:

380

answers:

3

I have a Python application. It has an SQLite database, full of data about things that happen, retrieved by a Web scraper from the Web. This data includes time-date groups, as Unix timestamps, in a column reserved for them. I want to retrieve the names of organisations that did things and count how often they did them, but to do this for each week (i.e. 604,800 seconds) I have data for.

Pseudocode:

for each 604800-second increment in time:
 select count(time), org from table group by org

Essentially what I'm trying to do is iterate through the database like a list sorted on the time column, with a step value of 604800. The aim is to analyse how the distribution of different organisations in the total changed over time.

If at all possible, I'd like to avoid pulling all the rows from the db and processing them in Python as this seems a) inefficient and b) probably pointless given that the data is in a database.

+1  A: 

Create a table listing all weeks since the epoch, and JOIN it to your table of events.

CREATE TABLE Weeks (
  week INTEGER PRIMARY KEY
);

INSERT INTO Weeks (week) VALUES (200919); -- e.g. this week

SELECT w.week, e.org, COUNT(*)
FROM Events e JOIN Weeks w ON (w.week = strftime('%Y%W', e.time))
GROUP BY w.week, e.org;

There are only 52-53 weeks per year. Even if you populate the Weeks table for 100 years, that's still a small table.

Bill Karwin
+1  A: 

To do this in a set-based manner (which is what SQL is good at) you will need a set-based representation of your time increments. That can be a temporary table, a permanent table, or a derived table (i.e. subquery). I'm not too familiar with SQLite and it's been awhile since I've worked with UNIX. Timestamps in UNIX are just # seconds since some set date/time? Using a standard Calendar table (which is useful to have in a database)...

SELECT
     C1.start_time,
     C2.end_time,
     T.org,
     COUNT(time)
FROM
     Calendar C1
INNER JOIN Calendar C2 ON
     C2.start_time = DATEADD(dy, 6, C1.start_time)
INNER JOIN My_Table T ON
     T.time BETWEEN C1.start_time AND C2.end_time  -- You'll need to convert to timestamp here
WHERE
     DATEPART(dw, C1.start_time) = 1 AND    -- Basically, only get dates that are a Sunday or whatever other day starts your intervals
     C1.start_time BETWEEN @start_range_date AND @end_range_date  -- Period for which you're running the report
GROUP BY
     C1.start_time,
     C2.end_time,
     T.org

The Calendar table can take whatever form you want, so you could use UNIX timestamps in it for the start_time and end_time. You just pre-populate it with all of the dates in any conceivable range that you might want to use. Even going from 1900-01-01 to 9999-12-31 won't be a terribly large table. It can come in handy for a lot of reporting type queries.

Finally, this code is T-SQL, so you'll probably need to convert the DATEPART and DATEADD to whatever the equivalent is in SQLite.

Tom H.
+1  A: 

Not being familiar with SQLite I think this approach should work for most databases, as it finds the weeknumber and subtracts the offset

SELECT org, ROUND(time/604800) - week_offset, COUNT(*)
FROM table
GROUP BY org, ROUND(time/604800) - week_offset

In Oracle I would use the following if time was a date column:

SELECT org, TO_CHAR(time, 'YYYY-IW'), COUNT(*)
FROM table
GROUP BY org, TO_CHAR(time, 'YYYY-IW')

SQLite probably has similar functionality that allows this kind of SELECT which is easier on the eye.

stili
yes, you can use the strftime function to format a date-time string (and the unixepoch modifier if that's what you start with), as well as the simpler approach based on truncating the unix-epoch number.
Alex Martelli
Thanks! this is the solution; TO_CHAR doesn't exist in sqlite, but strftime functions do and I was able to sort it this way.