views:

54

answers:

4

I have a process that builds a list from a database table and runs real time. Every now and then new data gets added to the database table. Querying data from the table every now and then is cumbersome, and time consuming, and this need to be as real time as possible.What is the right way to approach the problem?

The process is as follows

list gets build from an SQL Query that takes 2-4 seconds to execute.

List is used by Process A to perform some functions

Data gets constantly added to the Database Table. We need only the new data to get appended to the list, which will be used real time by the process A.

I have not tried writing any code yet, I am still not sure what kind of design it should be. Python is the only language we can use since there are 10,000 lines of Python code already deployed as a part of the system.

Can someone help me with the right approach, modules etc?

EDIT Process A is a procedure within the program. Pusedo code I am thinking off is something like this

def processA(list): while 1: parse file do something

def run(): list = generate list from run sql query processA(list)

if name=="main": run()

A: 

Wouldn't it make sense to just query the database again for all the new data that has come since you last queried? Something like key > highest_key_in_list or date > highest_date_in_list rather than loading up the whole thing again.

Noufal Ibrahim
Noufal, I did try this as a test code. But any SQL Query which still need to span across half a million of records is going to take time, and potentially memory. The overheads as I see is in connecting to database and then querying, and this when the database table sizes are large.
ramdaz
A: 

This feels a big hackish, but on your initial query create a temp table to hold the results. Add a column to that table that is an incrementing number I'll call id. Add a trigger to the table with data changing that updates your temp table. Check back with the database and query only for records with an id larger than the last element in your initial pull. Since that will be a much faster query, it should get you about as close to real time as you can get. Reusing the a persistent connection may help slightly too.

I'd also check that your database is indexed well for this query, 2-4 seconds seems like a long time. Maybe you can also optimize your query a bit.

Myles
Thanks. I have given thought to such an idea
ramdaz
A: 

Sorry, it is rather difficult to deduce what exactly you want to be done.

One point is not quite clear:
Is the process (python script?) to be run

  • continously (daemon-like), - in that case, you don't need to store the dataset anywhere if it is relatively small. You can just keep it in memory.
  • periodically (cron job), - in that case you do need a way to serialize the data between every invocation. You can use pickle for that, however I am unsure if unpickling will take less time, than retrieving the whole dataset from the database.

The reset depends on your database schema and on the way the data is added/updated. Is it mutated at all after it is inserted? If not, you can only select data "since the last invocation", by using a timestamp field in the table (if it has one) or the identity field (NOT reliable, actually. It is sort of reliable with MySQL, but then again, not quite).

If the data might be updated, then you will have to re-read the whole dataset (unless you have a way to select only new/updated entries).

As for 2-4 seconds, - there may be many reasons why it takes such time, - are you running queries, that mention unindexed fields?

shylent
A: 

If the data is being added to the table in the program, just add it to the list at the same time.

If multiple sources are adding to the table, query the table for all records whose primary key is > the last key you retrieved.

BlueRaja - Danny Pflughoeft