tags:

views:

110

answers:

4

I have three tables:

unmatched_purchases table:
unmatched_purchases_id --primary key
purchases_id --foreign key to events table
location_id --which store
purchase_date
item_id --item purchased

purchases table:
purchases_id --primary key
location_id --which store
customer_id

credit_card_transactions:
transaction_id --primary key
trans_timestamp --timestamp of when the transaction occurred
item_id --item purchased
customer_id
location_id

All three tables are very large. The purchases table has 590130404 records. (Yes, half a billion) Unmatched_purchases has 192827577 records. Credit_card_transactions has 79965740 records.

I need to find out how many purchases in the unmatched_purchases table match up with entries in the credit_card_transactions table. I need to do this for one location at a time (IE run the query for location_id = 123. Then run it for location_id = 456) "Match up" is defined as:

1) same customer_id
2) same item_id
3) the trans_timestamp is within a certain window of the purchase_date
  (EG if the purchase_date is Jan 3, 2005 
  and the trans_timestamp is 11:14PM Jan 2, 2005, that's close enough)

I need the following aggregated:

1) How many unmatched purchases are there for that location

2) How many of those unmatched purchases could have been matched with credit_card_transactions for a location.

So, what is a query (or queries) to get this information that won't take forever to run?

Note: all three tables are indexed on location_id

EDIT: as it turns out, the credit_card_purchases table has been partitioned based on location_id. So that will help speed this up for me. I'm asking our DBA if the others could be partitioned as well, but the decision is out of my hands.

CLARIFICATION: I only will need to run this on a few of our many locations, not all of them separately. I need to run it on 3 locations. We have 155 location_ids in our system, but some of them are not used in this part of our system.

+1  A: 

try this (I have no idea how fast it will be - that depends on your indices)

  Select Count(*) TotalPurchases, 
     Sum(Case When c.transaction_id Is Not Null 
          Then 1 Else 0 End) MatchablePurchases
  From unmatched_purchases u
     Join purchases p 
        On p.purchases_id = u.unmatched_purchases_id
     Left Join credit_card_transactions c
        On customer_id = p.customer_id
           And item_id = u.item_id 
           And trans_timestamp - purchase_date < @DelayThreshold
  Where Location_id = @Location
Charles Bretana
I'm not entirely sure how to ask what I'm think but I'll try: Will this build up the entire results table (IE after all the joins) then do the aggragation, or will will this look at each unmatch_purchase once, change the count and sum values, and move on? (does my question make sense? I'm concerned about how much memory this will take up while running)
David Oneill
This depends on the query optimizer/processor and depends to some degree on the data, and the server's current statistics about the data. These factors help the optimizer decide on what type of joins to use, and in doing so, whether to create "aggregation buckets" to dump data into as it processes the joins, or to defer this until later, after the intermediate unaggregated result set has been constructed. Look at whatever oracle tool functions as a query plan display tool (Oracle equivilent of SQL Server ShowPlan) to see what it is actually doing.
Charles Bretana
Ok, I'll check that.
David Oneill
@David, did some ad'tl research (asked Oracle guy that sits near me) and he said Oracle and/or Toad has a tool called "Display Plan" that does this... You should check that out...
Charles Bretana
Note for posterity: make sure to add u.location_id = p.location_id and c.location_id = u.location_id to the "on" statements of the joins.
David Oneill
+1  A: 

At least, you'll need more indexes. I propose at least the folloging:

An index on unmatched_purchases.purchases_id, one on purchases.location_id and another index on credit_card_transactions.(location_id, customer_id, item_id, trans_timestamp).

Without those indexes, there is little hope IMO.

ammoQ
Will that last index help even though I'm looking for a range of values? IE it'll be something like "trans_timestamp between date - fuzz and date + fuzz"
David Oneill
Yes, exactly; if you have exact values for the other three columns (through the join), the index will help you with the range selection.
ammoQ
It might. This may sound insulting, and I really don't mean it to be, but doesn't a business that has processed half a billion purchases and 80 million credit card transactions have enough money to employ an Oracle DBA who knows how to index tables appropriately?!
Tony Andrews
Yes we do, and I am not that person. I am working with our DBA in setting this up, but this query is a one-off query, and there are concerns more important that this one query that dictate what our partitioning and indexing.
David Oneill
A: 

I suggest you to query ALL locations at once. It will cost you 3 full scans (each table once) + sorting. I bet this will be faster then querying locations one by one.

But if you want not to guess, you at least need to examine EXPLAIN PLAN and 10046 trace of your query...

egorius
I'll check into that, but we have a lot of locations, and location_id is indexed on all three tables. I thought that meant that when doing something "where locaction_id = 123" meant it didn't have to do a truly full scan.
David Oneill
That's true but it also implies accessing the tables using single block reads, with a very high number of logical reads compared to using full table scans. Reading 1,000 rows from a very large table using an index might involve several thousand logical reads and many of those could easily be physical reads because of the number of blocks across which the rows are distributed (depending on clustering_factor)
David Aldridge
So reading an entire table through an index is much less efficient than reading it as a full table scan.
David Aldridge
Ah, I see your confusion. I wasn't clear: I don't need to run this for all locations. I need to run it on 3 of our 70 or so locations.
David Oneill
David Oneill, in this case my suggestion is not suitable, of course. I thought you need them all.
egorius
A: 

The query ought to be straightforward, but the tricky part is to get it to perform. I'd question why you need to run it once for each location when it would probably be more eficient to run it for every location in a single query.

The join would be a big challenge, but the aggregation ought to be straightforward. I would guess that your best hope performance-wise for the join would be a hash join on the customer and item columns, with a subsequent filter operation on the date range. You might have to fiddle with putting the customer and item join in an inline view and then try to stop the date predicate from being pushed into the inline view.

The hash join would be much more efficient with tables that are being equi-joined both having the same hash partitioning key on all join columns, if that can be arranged.

Whether to use the location index or not ...

Whether the index is worth using or not depends on the clustering factor for the location index, which you can read from the user_indexes table. Can you post the clustering factor along with the number of blocks that the table contains? That will give a measure of the way that values for each location are distributed throughout the table. You could also extract the execution plan for a query such as:

select some_other_column
from   my_table
where  location_id in (value 1, value 2, value 3)

... and see if oracle thinks the index is useful.

David Aldridge
Ah: I was unclear. I only need to run it for 2 or 3 of my locations, not all of them. *edits question*
David Oneill
Ah hah ... how many locations do you have in total?
David Aldridge