views:

34

answers:

1

I've got a web application, one of whose functions is which is constantly appending rows to a large table. The rows look approximately like this:

id bigint not null primary key
visited datetime not null
ip_address
# Other fields

It's a tracking table, as you've probably guessed. The use of this table is completely append-only, that is, there's no modifying rows after they've been inserted.

However, our data volume has increased significantly, and it's become necessary to do stats-processing separately into other tables, rather than querying the data and computing it immediately. Basically I've written a stand-alone program which approximately does this (pseudo-code)

while (true) {
    Select rows from tracking table where id > last_id
    Feed rows to stats processing thread
    last_id = max(id from rows)
    sleep some amount of time (~30sec is what I'm currently using)
}

However, I am worried I will be missing rows. A gap in ID's could occur because at the time I select the rows from the tracking table, some of the row id's have been reserved, but data in those transactions was not yet committed, and on the next loop I've already moved on to newer ID's.

I have been wondering how to reconcile this, as missing data when computing stats isn't really an option.

Here are the various options I have been tossing around:

  1. Re-design the table or query on something like the visited time

  2. Don't use a relational DB, but some sort of data queueing system?

  3. Query the table on a time lag, ie WHERE id > last_id AND visited < NOW() - some time

It's also possible there is an option I have not considered. What is the best way to query the table so I don't miss any data?

A: 

Here's one way you might do this (something like your #2):

  1. Replicate your data to a separate "warehouse" database.
  2. Perform an ETL so that your data is normalized in your warehouse more optimally for your queries/statistical analysis. For each record type, include an additional column to serve as a "status" flag indicating whether the record has been read by your analysis engine.
  3. Have your analysis update the "read" column once it has read a record.
AJ
Thanks for taking the time to answer :) I decided I'm going to implement a variation of your answer. Because having low latency between the data being stored and processed is a priority, I'm going to still read from the original table (skipping replicating and ETL), but I'll keep another table for the analysis engine to record which id's it has seen.
Crast
@Crast sounds great, good luck!
AJ