views:

1522

answers:

4

Hello all,

is it possible to retrieve statistics about the minimal or maximal value of a numeric column in Oracle 10g? I have found the table USER_TAB_COL_STATISTICS having a LOW_VALUE and HIGH_VALUE column, but I am not sure whether those are the values I am looking for.

I need to find an efficient way to ask the DBS for those statistics. Using a regular MIN(a) and MAX(a) query would be too slow on large tables.

Thanks in advance.

+3  A: 

Yes, LOW_VALUE and HIGH_VALUE will tell you the minimum and maximum values in the column but:

  • they are stored as RAW(32) columns, so the meaning will not be immediately apparent
  • they will be as of the last time statistics were gathered for the table, so may not be accurate (unless you explicitly gather stats before using them)

If you index the column then MIN(a) and MAX(a) should be very fast as in this example where T1 has 50000 rows and is indexed on OBJECT_ID:

SQL> select min(object_id) from t1;

MIN(OBJECT_ID)
--------------
           100

------------------------------------------------------------------------------------
| Id  | Operation                  | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |       |     1 |     5 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE            |       |     1 |     5 |            |          |
|   2 |   INDEX FULL SCAN (MIN/MAX)| T1_ID | 53191 |   259K|     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          2  consistent gets
          0  physical reads
          0  redo size
        419  bytes sent via SQL*Net to client
        380  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

The result is the same if you select the MAX instead of the MIN. However, if you select the MIN and MAX in a single select statement the result is different:

SQL> select min(object_id), max(object_id) from t1;

MIN(OBJECT_ID) MAX(OBJECT_ID)
-------------- --------------
           100          72809


-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |     5 |    34   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |     5 |            |          |
|   2 |   INDEX FAST FULL SCAN| T1_ID | 53191 |   259K|    34   (0)| 00:00:01 |
-------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        125  consistent gets
          0  physical reads
          0  redo size
        486  bytes sent via SQL*Net to client
        380  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

This suggests that it may be better to get them separately, though I haven't conclusively proved that.

Tony Andrews
Thanks Tony, that helps quite a lot :-).
Kage
@Tony, What is the explain plan if you query both the min and the max? This is a halfway test.
Theo
@Tony: Woah, ok. I have to change my implementation then. I tried to retrieve both at the same time... I'd vote you up, but I can't do that yet, but please accept my symbolic vote up :-).
Kage
@Theo, I have updated my answer. Thanks.
Tony Andrews
@kage: yes, it makes a big difference. I just found a 15M row table and tried it - 0.1 secs to select MIN or MAX separately, 8 seconds to select both at once!
Tony Andrews
Try select min(object_id) minid, NULL maxid from t1 union select NULL minid, max(object_id) maxid from t1
Theo
If you want the MIN, an index fast full scan will stop after it reads the first index entry, so is very fast. If you want the MAX, an index fast full scan will start with the last index entry and stop there, and is also fast. If you ask for both MIN and MAX, it has to scan the entire index, so will be slower. UNIONing the two queries together as Theo suggests will work around this effectively.
Jeffrey Kemp
+1  A: 

An example with a table containing numbers from 1 up to 1234:

SQL> create table t (nr) as select level from dual connect by level <= 1234
  2  /

Tabel is aangemaakt.

SQL> select min(nr)
  2       , max(nr)
  3    from t
  4  /

   MIN(NR)    MAX(NR)
---------- ----------
         1       1234

1 rij is geselecteerd.

If you analyze the table, the low_value and high_value columns contain the right numbers.

SQL> exec dbms_stats.gather_table_stats(user,'t')

PL/SQL-procedure is geslaagd.

SQL> select low_value
  2       , high_value
  3    from user_tab_columns
  4   where table_name = 'T'
  5     and column_name = 'NR'
  6  /

LOW_VALUE                                                        HIGH_VALUE
---------------------------------------------------------------- ----------------
C102                                                             C20D23

1 rij is geselecteerd.

They are raw, so they cannot be read easily. Using the utl_raw.cast_to_number function makes them readable:

SQL> select utl_raw.cast_to_number(low_value)
  2       , utl_raw.cast_to_number(high_value)
  3    from user_tab_columns
  4   where table_name = 'T'
  5     and column_name = 'NR'
  6  /

UTL_RAW.CAST_TO_NUMBER(LOW_VALUE) UTL_RAW.CAST_TO_NUMBER(HIGH_VALUE)
--------------------------------- ----------------------------------
                                1                               1234

1 rij is geselecteerd.

However, be careful: the numbers may be inaccurate when updates have taken place between the time the statistics were gathered and the time the query ran.

Regards, Rob.

Rob van Wijk
Thanks Rob, that helps too! Mmh, how would I determine which of the two methods (MIN/MAX vs. gather_table_stats) is more efficient?
Kage
I would only consider gather_table_stats if I had a situation where new data is incoming only once a week and min/max values are selected several times every minute.If you have a situation where data changes more often - use indexes.
jva
Indices need to be updates like statistics, don't they? I'm not really sure which one will be the case.The database middleware I'm developing should be pretty generic and I don't really know how often data will arrive in advance.But maybe one benefit of the MIN/MAX method is that it should work on any SQL-bases DBS whereas the gather statistics method is Oracle only. I am currently working with Oracle 10 below the middleware, but in the long run the system should work on any DBS. At the moment I only need the best solution for Oracle.
Kage
@Rob van Wijk, Is it possible to solve this problem with a fast refreshed materialized view?
Theo
In a normal database production environment, you'll already have a statistics gathering job that runs on a regular basis. I'm not suggesting to gather statistics each time before your query. In short: If you need to be accurate, use Tony's solution. If not, use the data dictionary's low and high value columns.
Rob van Wijk
@Theo: Good suggestion. A fast refreshable materialized view is another option indeed. Given the amount of extra work your current transactions need to perform to refresh the MV, I'd only recommend this solution on tables that are not modified frequently. Test and measure carefully.
Rob van Wijk
Rob, on a production database you will not gather statistics on regular basis. If you do that you risk sudden performance problems in random places.
jva
@jva: All shops I have consulted so far, do ... Mostly in combination with table monitoring, so only tables with more than 10% modification in their tables are having their statistics re-gathered. This way statistics are always sufficiently accurate. Not gathering statistics on production will let you run the risk of having outdated statistics.
Rob van Wijk
Rob, do you have any arguments in favor of statistic gathering in production? I remember reading T. Kyte's opinion (can't find link now) that one should gather statistics in test database and then load them into production only if there are no problems in test.
jva
jva, I already gave you the most important argument: keeping your statistics up-to-date. You do run a little risk that a certain query will perform poorer, but most changed plans are changed for the better. In 11g, SQL Plan Management seems an excellent tool for only promoting new plans when they have been proven to be better than a previous plan. I don't know a good reason why you would want to gather statistics in a test database and import them into production. The test database would then have to be an exact copy of production if you don't want strange things to happen.
Rob van Wijk
Rob, after reading about SQL Plan Management I agree with you and look forward to migration to 11g. (currently on 9i here)
jva
+1  A: 

The other answers here (using an index fast full scan; or examining the user_tab_columns statistics) are excellent.

Here's another method that might be suitable - if you're only interested in a rough estimate, you can use the SAMPLE clause (and adjust the sample size up or down depending on how accurate you need it):

SELECT max(value), min(value) FROM t SAMPLE(1);

This takes a 1% sample from the table. It will generally sample different rows each time it is run, so don't expect the results to be identical run-to-run. If you want it to run quicker, you can have lower sample sizes, e.g. SAMPLE(0.01), or if you want to sample half the table, SAMPLE(50).

The advantage of this approach over the "analyze, then-query-user-tab-cols" approach is that the analyze runs queries like this anyway in order to generate the statistics - so doing it this way may mean less work overall.

Jeffrey Kemp
This sounds great too! The good thing is that I don't have to rely on the admin being smart enough to create and update the table stats regularily. Thanks Jeffrey!
Kage
Another question about this: I tried to run this on a small table and it seems to me that if the percentage is too low then I won't get any results.
Kage
Yes, that's very likely. Think of it this way: SAMPLE(50) means that a coin will be flipped for each row: if it comes up heads, the row is returned; if it comes up tails, the row will be ignored. The lower the percentage, the higher the chances are that no rows will be selected. You have to adjust the SAMPLE percentage according to the expected number of rows in the table.
Jeffrey Kemp
A: 

In Order to speed up substantially the min/max functiality, prior to executing the statement ie: select min(value) from t; simply execute the following query to get lightning fast performance: truncate table t; following this the function will return lightning fast. Conglogerisation is a good technique also.

Veston Crekcoch