views:

247

answers:

2

I'm building a CGI script that polls a SQLite database and builds a table of statistics. The source database table is described below, as is the chunk of pertinent code. Everything works (functionally), but the CGI itself is very slow as I have multiple nested SELECT COUNT(id) calls. I figure my best shot at optimization is to ask the SO community as my time with Google has been relatively fruitless.

The table:

CREATE TABLE messages (
    id TEXT PRIMARY KEY ON CONFLICT REPLACE,
    date TEXT,
    hour INTEGER,
    sender TEXT,
    size INTEGER,
    origin TEXT,
    destination TEXT,
    relay TEXT,
    day TEXT);

(Yes, I know the table isn't normalized but it's populated with extracts from a mail log... I was happy enough to get the extract & populate working, let alone normalize it. I don't think the table structure has a lot to do with my question at this point, but I could be wrong.)

Sample row:

476793200A7|Jan 29 06:04:47|6|[email protected]|4656|web02.mydomain.pvt|[email protected]|mail01.mydomain.pvt|Jan 29

And, the Python code that builds my tables:

#!/usr/bin/python
print 'Content-type: text/html\n\n'

from datetime import date

import re
p = re.compile('(\w+) (\d+)')

d_month = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}
l_wkday = ['Mo','Tu','We','Th','Fr','Sa','Su']

days = []
curs.execute('SELECT DISTINCT(day) FROM messages ORDER BY day')
for day in curs.fetchall():
    m = p.match(day[0]).group(1)
    m = d_month[m]
    d = p.match(day[0]).group(2)
    days.append([day[0],"%s (%s)" % (day[0],l_wkday[date.weekday(date(2010,int(m),int(d)))])])

curs.execute('SELECT DISTINCT(sender) FROM messages')
senders = curs.fetchall()
for sender in senders:
    curs.execute('SELECT COUNT(id) FROM messages WHERE sender=%s',(sender[0]))
    print '  <div id="'+sender[0]+'">'
    print '   <h1>Stats for Sender: '+sender[0]+'</h1>'
    print '   <table><caption>Total messages in database: %d</caption>' % curs.fetchone()[0]
    print '    <tr><td>&nbsp;</td><th colspan=24>Hour of Day</th></tr>'
    print '    <tr><td class="left">Day</td><th>%s</th></tr>' % '</th><th>'.join(map(str,range(24)))
    for day in days:
            print '    <tr><td>%s</td>' % day[1]
            for hour in range(24):
                    sql = 'SELECT COUNT(id) FROM messages WHERE sender="%s" AND day="%s" AND hour="%s"' % (sender[0],day[0],str(hour))
                    curs.execute(sql)
                    d = curs.fetchone()[0]
                    print '    <td>%s</td>' % (d>0 and str(d) or '')
            print '    </tr>'
    print '   </table></div>'

print ' </body>\n</html>\n'

I'm not sure if there are any ways I can combine some of the queries, or approach it from a different angle to extract the data. I had also thought about building a second table with the counts in it and just updating it when the original table is updated. I've been staring at this for entirely too long today so I'm going to attack it fresh again tomorrow, hopefully with some insight from the experts ;)


Edit: Using the GROUP BY answer provided below, I was able to get the data needed from the database in one query. I switched to Perl since Python's nested dict support just didn't work very well for the way I needed to approach this (building a set of HTML tables in a specific way). Here's a snippet of the revised code:

my %data;
my $rows = $db->selectall_arrayref("SELECT COUNT(id),sender,day,hour FROM messages GROUP BY sender,day,hour ORDER BY sender,day,hour");
for my $row (@$rows) {
    my ($ct, $se, $dy, $hr) = @$row;
    $data{$se}{$dy}{$hr} = $ct;
}
for my $se (keys %data) {
    print "Sender: $se\n";
    for my $dy (keys %{$data{$se}}) {
    print "Day: ",time2str('%a',str2time("$dy 2010"))," $dy\n";
        for my $hr (keys %{$data{$se}{$dy}}) {
            print "Hour: $hr = ".$data{$se}{$dy}{$hr}."\n";
        }
    }
    print "\n";
}

What once executed in about 28.024s now takes 0.415s!

+1  A: 

For starters, create an index:

CREATE INDEX messages_sender_by_day ON messages (sender, day);

(You probably don't need to include "hour" in there.)

If that doesn't help or you've already tried it, then please fix up your question a bit: give us some code to generate test data and SQL for all indexes on the table.

Maintaining a count cache is fairly common, but I can't tell if that's needed here.

Glenn Maynard
+2  A: 

first of all you can use the group by clause:

select count(*), sender from messages group by sender;

and with this you execute one query for all senders instead of on query for each sender. Another possibility could be:

select count(*), sender, day, hour
    from messages group by sender, day, hour
    order by sender, day, hour;

i didn't test it but at least now you know the existances of group by clause. this should reduce the number of queries and i think this is the first big step to increase performance.

second, create indexes based on search columns, in your case sender, day and hour.

if this isn't enough use profiling tools to find where the most the time is spent. you should also consider the use of fetchmany instead of fetchall to keep low memory consumption. remember that since sqlite module is coded in C use it as much as possible.

mg
Thanks, this is exactly what I was looking for: a way to get the data needed in as few queries as possible. This is MUCH faster.
Justin