views:

69

answers:

4

Hi,

I have one simple but large table.

id_tick   INTEGER      eg: 1622911
price     DOUBLE       eg: 1.31723
timestamp DATETIME     eg: '2010-04-28 09:34:23'

For 1 month of data, I have 2.3 millions rows (150MB)

My query aims at returning the latest price at a given time.

I first set up a SQLite table and used the query:

SELECT max(id_tick), price, timestamp 
FROM EURUSD 
WHERE timestamp <='2010-04-16 15:22:05'

It is running in 1.6s.

As I need to run this query several thousands of time, 1.6s is by far too long...

I then set up a MySQL table and modified the query (the max function differs from MySQL to SQLite):

SELECT id_tick, price, timestamp
FROM EURUSD
WHERE id_tick = (SELECT MAX(id_tick) 
                 FROM EURUSD WHERE timestamp <='2010-04-16 15:22:05')

Execution time is getting far worse 3.6s (I know I can avoid the sub query using ORDER BY and LIMIT 1 but it does not improve the execution time.)

I am only using one month of data for now, but I will have to use several years at some point.

My questions are then the following:

  1. is there a way to improve my query?
  2. given the large dataset, should I use another database engine?
  3. any tips ?

Thanks !

A: 

Do you have any indexed fields ?

indexing timestamp and/or id_tick could change a lot of things.

Also why don't you use an interval for timestamp ?

WHERE timestamp >= '2010-04-15 15:22:05' AND timestamp <= '2010-04-16 15:22:05'

that would ease the burden of the MAX function.

siukurnin
yes, timestamp and id_tick are already indexed.Adding an interval does not make a difference here, even if it is a very small interval.
Sam
+1  A: 

1) Make sure you have an index on timestamp

2) Assuming that id_tick is both the PRIMARY KEY and Clustered Index, and assuming that id_tick increments as a function of time (since you are doing a MAX)

You can try this:

SELECT id_tick, price, timestamp 
FROM EURUSD 
WHERE id_tick = (SELECT id_tick
                   FROM EURUSD WHERE timestamp <='2010-04-16 15:22:05'
                   ORDER BY id_tick DESC
                   LIMIT 1)

This should be similar to janmoesen's performance though, since there should be high page correlation between id_tick and timestamp in any event

nonnb
Also, using partitioning will help protect performance as more data is added.
OmerGertel
1) I have an index on timestamp and id_tick2) all assumptions are correctYou query is executed in 6.2s
Sam
Try dropping and recreating the indexes - something has gone horribly wrong with statistics or cached plans IMHO.
nonnb
A: 

You are doing analysis using ALL the ticks for large intervals? I'd tried to filter data into minute/hour/day etc. graphs.

alxx
No I am only using a very limited number of ticks, what about breaking down the data in may tables? Instead of EURUSD, I would have EURUSD_20100501, EURUSD_20100502 ... one table a day ?
Sam
Split would help, I guess, but you will have to deal with borders.
alxx
@alxx: no, no, no, no ! you need to treat this as a snowflake schema with a number of dimension tables such as time, currency etc and a forex fact table.
f00
@alxx to deal with borders, I was thinking of always using the union of two tables (I will not need more)@foo I am not familiar with snowflake schema (and google is not helping), would you mind elaborate a bit or posting a link to explicit your point ? thks!
Sam
A: 

OK, I guess my index was corrupted somehow, a re-indexation greatly improved the performance.

The following is now executed in 0.0012s (non cached)

SELECT id_tick, price, timestamp
FROM EURUSD
WHERE timestamp <= '2010-05-11 05:30:10'
ORDER by id_tick desc
LIMIT 1

Thanks!

Sam