I've got a web application, one of whose functions is which is constantly appending rows to a large table. The rows look approximately like this:
id bigint not null primary key
visited datetime not null
ip_address
# Other fields
It's a tracking table, as you've probably guessed. The use of this table is completely append-only, that is, there's no modifying rows after they've been inserted.
However, our data volume has increased significantly, and it's become necessary to do stats-processing separately into other tables, rather than querying the data and computing it immediately. Basically I've written a stand-alone program which approximately does this (pseudo-code)
while (true) {
Select rows from tracking table where id > last_id
Feed rows to stats processing thread
last_id = max(id from rows)
sleep some amount of time (~30sec is what I'm currently using)
}
However, I am worried I will be missing rows. A gap in ID's could occur because at the time I select the rows from the tracking table, some of the row id's have been reserved, but data in those transactions was not yet committed, and on the next loop I've already moved on to newer ID's.
I have been wondering how to reconcile this, as missing data when computing stats isn't really an option.
Here are the various options I have been tossing around:
Re-design the table or query on something like the visited time
Don't use a relational DB, but some sort of data queueing system?
Query the table on a time lag, ie WHERE id > last_id AND visited < NOW() - some time
It's also possible there is an option I have not considered. What is the best way to query the table so I don't miss any data?