I'm building a CGI script that polls a SQLite database and builds a table of statistics. The source database table is described below, as is the chunk of pertinent code. Everything works (functionally), but the CGI itself is very slow as I have multiple nested SELECT COUNT(id)
calls. I figure my best shot at optimization is to ask the SO community as my time with Google has been relatively fruitless.
The table:
CREATE TABLE messages (
id TEXT PRIMARY KEY ON CONFLICT REPLACE,
date TEXT,
hour INTEGER,
sender TEXT,
size INTEGER,
origin TEXT,
destination TEXT,
relay TEXT,
day TEXT);
(Yes, I know the table isn't normalized but it's populated with extracts from a mail log... I was happy enough to get the extract & populate working, let alone normalize it. I don't think the table structure has a lot to do with my question at this point, but I could be wrong.)
Sample row:
476793200A7|Jan 29 06:04:47|6|[email protected]|4656|web02.mydomain.pvt|[email protected]|mail01.mydomain.pvt|Jan 29
And, the Python code that builds my tables:
#!/usr/bin/python
print 'Content-type: text/html\n\n'
from datetime import date
import re
p = re.compile('(\w+) (\d+)')
d_month = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}
l_wkday = ['Mo','Tu','We','Th','Fr','Sa','Su']
days = []
curs.execute('SELECT DISTINCT(day) FROM messages ORDER BY day')
for day in curs.fetchall():
m = p.match(day[0]).group(1)
m = d_month[m]
d = p.match(day[0]).group(2)
days.append([day[0],"%s (%s)" % (day[0],l_wkday[date.weekday(date(2010,int(m),int(d)))])])
curs.execute('SELECT DISTINCT(sender) FROM messages')
senders = curs.fetchall()
for sender in senders:
curs.execute('SELECT COUNT(id) FROM messages WHERE sender=%s',(sender[0]))
print ' <div id="'+sender[0]+'">'
print ' <h1>Stats for Sender: '+sender[0]+'</h1>'
print ' <table><caption>Total messages in database: %d</caption>' % curs.fetchone()[0]
print ' <tr><td> </td><th colspan=24>Hour of Day</th></tr>'
print ' <tr><td class="left">Day</td><th>%s</th></tr>' % '</th><th>'.join(map(str,range(24)))
for day in days:
print ' <tr><td>%s</td>' % day[1]
for hour in range(24):
sql = 'SELECT COUNT(id) FROM messages WHERE sender="%s" AND day="%s" AND hour="%s"' % (sender[0],day[0],str(hour))
curs.execute(sql)
d = curs.fetchone()[0]
print ' <td>%s</td>' % (d>0 and str(d) or '')
print ' </tr>'
print ' </table></div>'
print ' </body>\n</html>\n'
I'm not sure if there are any ways I can combine some of the queries, or approach it from a different angle to extract the data. I had also thought about building a second table with the counts in it and just updating it when the original table is updated. I've been staring at this for entirely too long today so I'm going to attack it fresh again tomorrow, hopefully with some insight from the experts ;)
Edit: Using the GROUP BY answer provided below, I was able to get the data needed from the database in one query. I switched to Perl since Python's nested dict support just didn't work very well for the way I needed to approach this (building a set of HTML tables in a specific way). Here's a snippet of the revised code:
my %data;
my $rows = $db->selectall_arrayref("SELECT COUNT(id),sender,day,hour FROM messages GROUP BY sender,day,hour ORDER BY sender,day,hour");
for my $row (@$rows) {
my ($ct, $se, $dy, $hr) = @$row;
$data{$se}{$dy}{$hr} = $ct;
}
for my $se (keys %data) {
print "Sender: $se\n";
for my $dy (keys %{$data{$se}}) {
print "Day: ",time2str('%a',str2time("$dy 2010"))," $dy\n";
for my $hr (keys %{$data{$se}{$dy}}) {
print "Hour: $hr = ".$data{$se}{$dy}{$hr}."\n";
}
}
print "\n";
}
What once executed in about 28.024s now takes 0.415s!