views:

88

answers:

8

Assume a table named 'log', there are huge records in it.

The application usually retrieves data by simple SQL:

SELECT * 
FROM log 
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)

logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.

How do we fix this?

A: 

A couple of things

do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.

Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk

SQLMenace
more RAM? do you mean that I should write algorithm for cache?
卢声远 Shengyuan Lu
No more RAM for the machine so that subsequent requests hit the RAM instead of getting data from disk
SQLMenace
More RAM is often not a choice a developer can influence, let alone make.
AllenG
+4  A: 

1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.

AllenG
+1 for the index suggestion. Great point.
Robb
May I suggest a slight rewording - "Make sure you have the correct indexes, and that your *statistics* are up to date", it's pretty rare to have your indexes out of date! :-)
Kragen
@AllenG: dumb question: can it happen that the indexes are NOT up-to-date ? I always assumed that it was "automatic".
iDevlop
@Kragen: fair. Update forthcoming.
AllenG
@Patrick - Indexes can **never** be out of date, however statistics (the data that SQL server uses to choose the most efficient way of executing a query) can be out of date.
Kragen
@Kragen: minor quibble: It is possible to have out-of-date indexes. Only in the case where the columns you want to index have changed and you haven't changed the columns you're indexing. At that point, though, it's definately a Carbon-based error.
AllenG
@AllenG - Oh, never knew that - is that the sort of out-of-date that means SQL server could potentially return *incorrect* results for a query?
Kragen
@Kragen: absolutely. I believe it's best summed up with 'the computer did exactly what you told it. If you told it to do the wrong thing, that's your fault." :)
AllenG
+4  A: 

Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.

If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.

UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:

What is it that you are actually trying to achieve?

  • If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.

  • If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)

Kragen
+1 for asking what the problem behind the question is "what are you actually trying to achieve"
Kris C
+1  A: 

Do you need all columns? First step should be to select only those you actually need to retrieve.

Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).

There can be some potential for improvement on the side of the processing application.

You should answer yourself these questions:

Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?

Marek
+2  A: 

Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.

Keith
'shard' could be fine!
卢声远 Shengyuan Lu
A: 

For me there are two things that you can do,

  1. Partition the table horizontally based on the date column

  2. Use the concept of pre-aggregation.

Pre-aggregation: In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:

a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.

b. Aggregate the logs for that particular hour from the logs_temp table

c. Save the aggregated results in the summary table

d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.

This way results are pre-aggregated in the summary table.

Whenever you wish to select the result, you would select it from the summary table.

This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.

Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.

You can read more about Shadow Table trick here

I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.

ovais.tariq
A: 

As per other answers, do not use 'select *' unless you really need all the fields.

logLevel and creationData have indexes

You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.

Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.

C.

symcbean
A: 

I really hope that by creationData you mean creationDate.

First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1. What you need is a single index on both fields:

CREATE INDEX i_log_1 ON log (creationData, logLevel);

Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).

Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.

If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:

DECLARE
  l_ownname          VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
  l_tabname          VARCHAR2(255) := 'log'; -- Table to analyze
  l_estimate_percent NUMBER(3) := 5;  -- Percentage of rows to estimate (NULL means compute)
BEGIN
  dbms_stats.gather_table_stats (
     ownname => l_ownname ,
      tabname => l_tabname,
      estimate_percent => l_estimate_percent,
      method_opt => 'FOR ALL INDEXED COLUMNS',
      cascade => TRUE
  );
END;

Otherwise, if the table is small, use

ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;

Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:

Sergey Stadnik