views:

553

answers:

5

I have a database infrastructure where we are regularly (at least once a day) replicating the full content of tables from a source database to approximately 20 target databases. Due to the replication code in use (we have to use regular oracle queries, no control or direct access to source database) - this results in 20 full-table sorts of the source table.

Is there any way to optimize for this in the query? I'm looking for something that would basically tell oracle "I'm going to be repeatedly sorting this entire table"? MySQL had an option with myisamchk where you could tell it to sort a table and keep it in sorted order, but obviously that wouldn't apply here for multiple reasons.

Currently, there are also some intermediate tables involved (sync from A to B, then from B to C.) We do have control over the intermediate tables, so if there are tuning options there, that would be useful as well.

Generally, the queries are almost all of the very simplistic form:

select a, b, c, d, e, ... z from tbl1 order by a, b, c, d, e, ... z;

I'm aware of streams, but as described above, the primary source tables are outside of our control, so we won't be able to use streams there. (Additionally, those source tables are rebuilt completely from a snapshot daily, so streams wouldn't really work anyway.)

A: 

Hello,

Some things that would help the sorting issue is to have indexes on the columns that you are sorting on (and also joining the tables on, if they're not there already). You could also create materialized views which are already sorted, and Oracle would keep the sorted results cached.

Jon
You could do a clustered index on a single table, which would put it in sorted order. It has advantages and disadvantages. Asktom.oracle.tom has great advice about that. But I don't think that would have much effect when you're joining a bunch of tables together.
Jon
Index Organised Tables are 'ordered' (or more correctly structured) by the primary key. In an indexed cluster, Oracle Database stores together rows having the same cluster key value. They are not necessarily stored in a sorted order. That is, if clustered on dept id, all employees in DEPT 10 would be together, and all employees in DEPT 20 together, but DEPT 20 may come out before DEPT 10 in a full table scan.
Gary
That might be workable on the intermediate tables, but beyond just "Adding an index", unlikely to be able to get anything done to the source tables. (A peoplesoft installation is the primary data source.) This is a straight 1Table:1Table sync, no joins involved.
Nathan Neulinger
A: 

You don't say exactly how the replication is done or the data volumes involved (or why you are sorting the data).

If the aim is to minimise the impact on the source database, your best bet may be to extract into an intermediate file and load the file into the destination databases. The sort could be done on the intermediate file (if plain text), or as part of either the export or import into the destination databases.

In source database :
create table export_emp_info
 organization external
 ( type oracle_datapump
   default directory DATA_PUMP_DIR
   location ('emp.dmp')
) as select emp_id, emp_name, dept_id from emp order by dept_id
/


Copy file then, import in dest database:
create table import_emp_info
(EMP_ID                    NUMBER(12),
EMP_NAME                   VARCHAR2(100),
DEPT_ID                    NUMBER)
 organization external
 ( type oracle_datapump
   default directory DATA_PUMP_DIR
   location ('emp.dmp')
)
/
insert into emp_info select * from import_emp_info;

If you don't want or can't have the external table on the source db, you can use a straight expdp of the emp table (possibly using NETWORK_LINK if you have limited access to the source database directory structure) and QUERY to do the ordering.

Gary
Basically, I'm pulling the rows sequentially from both sides (source and destination) and doing a "merge sort style sync", comparing each row in turn from either source or destination to know whether I need to insert the new row from the source or remove the row from the destination. It guarantees that I can run the process with any state of source/dest table and end up with them identical or same as they started in the case of a rollback. I hadn't considered the 'external file + sort' approach, will think about that.
Nathan Neulinger
+2  A: 

Hi Nathan,

you could look into the multi-table INSERT feature. It should perform a single FULL SCAN and will insert into multiple tables. Consider (10gR2):

SQL> CREATE TABLE t1 (ID NUMBER);

Table created
SQL> CREATE TABLE t2 (ID NUMBER);

Table created

SQL> INSERT ALL
  2     INTO t1 VALUES (d_id)
  3     INTO t2 VALUES (d_id)
  4  /* your select goes here */
  5  SELECT ROWNUM d_id FROM dual d CONNECT BY LEVEL <= 5;

10 rows inserted
SQL> SELECT COUNT(*) FROM t1;

  COUNT(*)
----------
         5
SQL> SELECT COUNT(*) FROM t2;

  COUNT(*)
----------
         5

You will have to check if it works over database links.

Vincent Malgrat
+1 Good suggestion. And I've implemented production code using this construct over database links (10.2.0.4).
Rob van Wijk
Unfortunately, this does what the code was written to avoid... Works great if you have a small table, but if you have a table with 1.5M rows in it, the code above basically forces you to empty the table first, which either means a huge amount of redo, or a table that isn't safe to use during the reload period and can potentially be completely unavailable. (Basically, you've avoided the sort by trading it for inserting all data every time.)
Nathan Neulinger
A: 

You could load data from source table A to an intermediate table B and then do a partition exchange between B and destination table C. Exact replication, no sorting involved.

jva
A: 

This I/U/D form of replication is what the MERGE command is there for. It's very doubtful that an expensive sort-merge would be required, and I'd expect to see hash joins instead. As long as the hash table can be stored in memory the hash join is barely more expensive than scanning the tables.

A handy optimisation is to store a hash value based on the non-key attributes, so that you can join between source and target tables on the key column(s) and compare small hash values instead of the full set of columns - change detection made easy.

David Aldridge