ansaurus

Question

How can I speed up queries against huge data warehouse tables with effective-dated data?

Answer 1

A:

Instead of using the subqueries, you can try this. I don't know if Oracle will perform better with this or not, since I don't use Oracle much.

SELECT
    ST1.col1,
    ST1.col2,
    ...
FROM
    Some_Table ST1
LEFT OUTER JOIN Some_Table ST2 ON
    ST2.user_id = ST1.user_id AND
    (
        ST2.effective_date > ST1.effective_date OR
        (
            ST2.effective_date = ST1.effective_date AND
            ST2.effective_sequence > ST1.effective_sequence
        )
    )
WHERE
    ST2.user_id IS NULL

Another possible solution would be:

SELECT
    ST1.col1,
    ST1.col2,
    ...
FROM
    Some_Table ST1
WHERE
    NOT EXISTS
    (
        SELECT
        FROM
            Some_Table ST2
        WHERE
            ST2.user_id = ST1.user_id AND
            (
                ST2.effective_date > ST1.effective_date OR
                (
                    ST2.effective_date = ST1.effective_date AND
                    ST2.effective_sequence > ST1.effective_sequence
                )
            )
    )

Tom H. 2010-06-24 20:09:50

Answer 2

A:

Would it be an option to create a database that you use for non-warehousing type stuff that you could update on a nightly basis? If it is you could create a nightly process that will move over only the most recent records. That would get rid of the MAX stuff you are doing for every day queries and significantly reduce the number or records.

Also, depends on whether you can have a 1 day lapse between the most recent data and what is available.

I'm not super familiar with Oracle so there may be a way to get improvements by making changes to your query also...

Abe Miessler 2010-06-24 20:11:49

Answer 3

+1 A:

You haven't mentioned the requirements for the freshness of the data, but one option would be to create materialized views (you'll be restricted to REFRESH COMPLETE since you can't create snapshot logs in the source system) that have data only for the current versioned row of the transaction tables. These materialized view tables will reside in your local system and additional indexing can be added to them to improve query performance.

dpbradley 2010-06-24 20:22:22

I like the idea, but the data needs to be real-time.

RenderIn 2010-06-24 20:28:33

then consider a materialised view in the remote DB which lists the most recent row for each userid. This would reduce the volume of data moving across the remote link.

Karl 2010-06-25 08:20:52

Answer 4

A:

The performance issue is going to be the access across the link. With part of the query against local tables, it's all being executed locally so no access to the remote indexes and it's pulling all the remote data back to test lkocally.

If you could use materialized views in a local database refreshed from the peoplesoft database on a periodic (nightly) basis for the historic data, only accessing the remote peoplesoft database for today's changes (adding a effective_date = today to your where clause) and merging the two queries.

Another option might be to use an INSERT INTO X SELECT FROM just for the remote data to pull it into a temporary local table or materialized view, then a second query to join that with your local data... similar to josephj1989's suggestion

Alternatively (though there may be licensing issues) try RAC Clustering your local db with the remote peoplesoft db.

Mark Baker 2010-06-24 20:57:09

Answer 5

+2 A:

Does refactoring your query something like this help at all?

SELECT *
  FROM (SELECT st.*, MAX(st.effective_date) OVER (PARTITION BY st.user_id) max_dt,
                     MAX(st.effective_sequence) OVER (PARTITION BY st.user_id, st.effective_date) max_seq
          FROM local_table lt JOIN sometable@remotedb st ON (lt.user_id = st.user_id)
         WHERE lt.user_id in ('123', '456', '789'))
 WHERE effective_date = max_dt
   AND effective_seq = max_seq;

I agree with @Mark Baker that performance joining over DB Links really can suck and you're likely to be limited in what you can accomplish with this approach.

DCookie 2010-06-24 21:19:07

Answer 6

+2 A:

One option is to first materialize the remote part of the query using a common table expression so you can be sure only relevantt data is fetched from remote db.Another improvement would be to merge the 2 subqueries against the remote db into one analytical function based subquery.Such a query can be used in your current query also. I can make other suggestions only after playing with the db.

see below

with remote_query as
(
    select /*+ materialize */  st.* from sometable@remotedb st
    where st.user_id in ('123', '456', '789')
    and st.rowid in( select first_value(rowid) over (order by effective_date desc, 
                         effective_sequence desc ) from sometable@remotedb st1 
                      where st.user_id=st1.user_id)
)

select lt.*,st.* 
FROM local_table st,remote_query rt
where st.user_id=rt.user_id

josephj1989 2010-06-24 21:21:40

+1, like your use of analytic. I think you might be cheating a bit in assuming a hard coded list of user_id's can be specified in your WITH clause, though - if it were that simple, you wouldn't need the join to local_table in the first place.

DCookie 2010-06-24 22:52:41

Answer 7

A:

Can you ETL the rows with the desired user_id's into your own table, creating only the needed indexes to support your queries and perform your queries on it?

Frank Computer 2010-06-24 21:46:58

Answer 8

A:

Is the PeopleSoft table a delivered one, or is it custom? Are you sure it's a physical table, and not a poorly-written view on the PS side? If it's a delivered record you're going against (example looks much like PS_JOB or a view that references it), maybe you could indicate this. PS_JOB is a beast with tons of indexes delivered, and most sites add even more.

If you know the indexes on the table, you can use Oracle hints to specify a preferred index to use; that sometimes helps.

Have you done an explain plan to see if you can determine where the problem is? Maybe there's a cartesian join, full table scan, etc.?

Chip L 2010-06-24 23:49:55

Answer 9

+2 A:

One approach would be to stick PL/SQL functions around everything. As an example

create table remote (user_id number, eff_date date, eff_seq number, value varchar2(10));

create type typ_remote as object (user_id number, eff_date date, eff_seq number, value varchar2(10));
.
/

create type typ_tab_remote as table of typ_remote;
.
/

insert into remote values (1, date '2010-01-02', 1, 'a');
insert into remote values (1, date '2010-01-02', 2, 'b');
insert into remote values (1, date '2010-01-02', 3, 'c');
insert into remote values (1, date '2010-01-03', 1, 'd');
insert into remote values (1, date '2010-01-03', 2, 'e');
insert into remote values (1, date '2010-01-03', 3, 'f');

insert into remote values (2, date '2010-01-02', 1, 'a');
insert into remote values (2, date '2010-01-02', 2, 'b');
insert into remote values (2, date '2010-01-03', 1, 'd');

create function show_remote (i_user_id_1 in number, i_user_id_2 in number) return typ_tab_remote pipelined is
    CURSOR c_1 is
    SELECT user_id, eff_date, eff_seq, value
    FROM
        (select user_id, eff_date, eff_seq, value, 
                        rank() over (partition by user_id order by eff_date desc, eff_seq desc) rnk
        from remote
        where user_id in (i_user_id_1,i_user_id_2))
    WHERE rnk = 1;
begin
    for c_rec in c_1 loop
        pipe row (typ_remote(c_rec.user_id, c_rec.eff_date, c_rec.eff_seq, c_rec.value));
    end loop;
    return;
end;
/

select * from table(show_remote(1,null));

select * from table(show_remote(1,2));

Rather than having user_id's passed individually as parameters, you could load them into a local table (eg a global temporary table). The PL/SQL would loop then through the table, doing the remote select for each row in the local table. No single query would have both local and remote tables. Effectively you would be writing your own join code.

Gary 2010-06-25 01:45:17

+1 I like your thinking

Mark Baker 2010-06-25 08:21:41

+1, now, for something completely different...

DCookie 2010-06-25 14:52:32

Answer 10

A:

It looks to me that you are dealing with a type 2 dimension in the data warehouse. There are several ways how to implement type 2 dimension, mostly having columns like ValidFrom, ValidTo, Version, Status. Not all of them are always present, it would be interesting if you could post the schema for your table. Here is an example of how it may look like (John Smith moved from Indiana to Ohio on 2010-06-24)

UserKey  UserBusinessKey State    ValidFrom    ValidTo   Version  Status
7234     John_Smith_17   Indiana  2005-03-20  2010-06-23    1     expired
9116     John_Smith_17   Ohio     2010-06-24  3000-01-01    2     current

To obtain the latest version of a row, it is common to use

WHERE Status = 'current'

or

WHERE ValidTo = '3000-01-01'

Note that this one has some constant far in the future.

or

WHERE ValidTo > CURRENT_DATE

Seems that your example uses ValidFrom (effective_date), so you are forced to find max() in order to locate the latest row. Take a look at the schema -- is there Status or ValidTo equivalents in your tables?

Damir Sudarevic 2010-06-25 10:45:26

ansaurus

tags:

views:

answers:

How can I speed up queries against huge data warehouse tables with effective-dated data?

related questions