views:

125

answers:

3

Pawnshop Application (any RDBMS):

one-to-many relationship where each customer (master) can have many transactions (detail).

customer(
id serial,
pk_name char(30), {PATERNAL-NAME MATERNAL-NAME, FIRST-NAME MIDDLE-NAME-INITIAL}
[...]
);
unique index on id;
unique cluster index on pk_name;


transaction(
fk_name char(30),
tran_type char(1), 
ticket_number serial,
[...]
);
dups cluster index on fk_name;
unique index on ticket_number; 

Several people have told me this is not the correct way to join master to detail. They said I should always join customer.id[serial] to transactions.id[integer].

When a customer pawns merchandise, clerk queries the master using wildcards on name. The query usually returns several customers, clerk scrolls until locating the right name, enters a 'D' to change to detail transactions table, all transactions are automatically queried, then clerk enters an 'A' to add a new transaction.

The problem with using customer.id joining transaction.id is that although the customer table is maintained in sorted name order, clustering the transaction table by fk_id groups the transactions by fk_id, but they are not in the same order as the customer name, so when clerk is scrolling through customer names in the master, the system has to jump allover the place to locate the clustered transactions belonging to each customer. As each new customer is added, the next id is assigned to that customer, but new customers dont show up in alphabetical order. I experimented using id joins and confirmed the decrease in performance.

The drawbacks of using name joins vs. id joins is if you change customer name, the join with their transactions is severed, so I dont allow updating the name. Anyway, how often does one need to change a customers name? The other draw back is name requires 30 chars where id is INT, so .dat and .idx are larger. Every morning an sql proc is executed which unloads customer and transactions in sorted name order, drops/re-creates the tables, loads the unloaded data and all indexes are re-created which keeps performance optimized.

How can I use id joins instead of name joins and still preserve the clustered transaction order by name if transactions has no name column?

The following is an example of how the data sits in customer.dat and transactions.dat when using pk/fk name, as described in the above schema:

customer.id customer.pk_name               transaction.fk_name            transaction.ticket_number
----------- ------------------------------ ------------------------------ -------------
          2|ACEVEDO BERMUDEZ, FRANCISCO J. ACEVEDO BERMUDEZ, FRANCISCO J.|123456
                                           ACEVEDO BERMUDEZ, FRANCISCO J.|123789

          3|ANDUJAR RODRIGUEZ, WILFREDO C. ANDUJAR RODRIGUEZ, WILFREDO C.|101010
                                           ANDUJAR RODRIGUEZ, WILFREDO C.|121212

          1|CASTILLO DIAZ, FRANKLIN J.     CASTILLO DIAZ, FRANKLIN J.    |232323
                                           CASTILLO DIAZ, FRANKLIN J.    |343434

So, when clerk wilcard queries by customer master name, customers transactions are automatically queried and quickly displayed when clerk scrolls thru names returned into the current list since they are in the same sorted order as the master.

Now, the following example is the same data using pk/fk id:

customer.pk_id customer.name                  transactions.fk_id transactions.ticket_#
-------------- ------------------------------ ------------------ ---------------------
             2|ACEVEDO BERMUDEZ, FRANCISCO J.                  1|232323
                                                               1|343434

             3|ANDUJAR RODRIGUEZ, WILFREDO C.                  2|123456
                                                               2|123789

             1|CASTILLO DIAZ, FRANKLIN J.                      3|101010
                                                               3|121212

OK, so now keep in mind that my perform 1-page screen includes all customer columns and all transactions columns, and there's a master/detail instruction which when the clerk queries by customer name, the first transaction row belonging to that customer is automatically displayed. Then the clerk will press 'D' to make transactions the active table and press 'A' to add a new transaction, or clerk may scroll through all the customers transactions to update one in particular or just provide customer with info.

When using the pk/fk name method, as the clerk scrolls through customer names to locate the desired customer, response is immediate. Whereas when using the pk/fk id method, response time lags, even with supported indexing, because the engine has to jump to different locations in the transactions table to locate the corresponding group of transactions belonging to each customer as clerk scrolls through each customer name in the master!

So, it seems like having the customer's transaction rows grouped together and in the same sorted order as the customer rows allows the indexing to locate the transactions quicker as opposed to having to jump all over scattered groups of each customers transactions. If each customer could remember their customer i.d. number, then my issue would be academic, but in the realworld, we even gave each customer an i.d. card with their customer number on it, but most of them lost their cards!

Here's an example of the daily reorg executed every morning before pawnshop opens for business:

 {ISQL-SE (customer and transactions table reorg - once-daily, before start of    
  business, procedure}

 unload to "U:\UNL\CUSTOMERS.UNL"
    select * from customer
  order by customer.pk_name; 

 unload to "U:\UNL\TRAN_ACTIVES.UNL" 
    select * from transaction where transaction.status = "A" 
  order by transaction.fk_name, transaction.trx_date; 

 unload to "U:\UNL\TRAN_INACTIVES.UNL" 
    select * from transaction
     where transaction.status != "A" 
       and transaction.trx_date >= (today - 365) 
  order by transaction.fk_name, transaction.trx_date desc; 

 unload to "U:\UNL\TRAN_HISTORIC.UNL" 
    select * from transaction 
     where transaction.status != "A" 
       and transaction.trx_date < (today - 365) 
  order by transaction.trx_date desc; 

 drop table customer;     

 drop table transaction;

 create table customer
 (
  id serial,
  pk_name char(30),
  [...]
 ) 
 in "S:\PAWNSHOP.DBS\CUSTOMER";


 create table transaction
 ( 
  fk_name char(30),
  ticket_number serial,
  tran_type char(1), 
  status char(1), 
  trx_date date, 
  [...]
 )
 in "S:\PAWNSHOP.DBS\TRANSACTION"; 

 load from "U:\UNL\CUSTOMERS.UNL"      insert into customer     {>4800 nrows}
 load from "U:\UNL\TRAN_ACTIVES.UNL"   insert into transaction; {500:600 nrows avg.} 
 load from "U:\UNL\TRAN_INACTIVES.UNL" insert into transaction; {6500:7000 nrows avg.} 
 load from "U:\UNL\TRAN_HISTORIC.UNL"  insert into dss:historic;{>500K nrows} 

 create unique cluster index cust_pk_name_idx on customer(pk_name);
 create        cluster index tran_cust_idx    on transaction(fk_name); 

 {this groups each customers transactions together, actives in 
  oldest trx_date order first, then inactive transactions within the last year in most  
  recent trx_date order. inactives older than 1 year are loaded into historic  
  table in a separate database, on a separate hard disk. historic table  
  optimization is done on a weekly basis for DSS queries.} 

 create unique index tran_ticket_num_idx on transaction(ticket_num); 
 create        index tran_trx_date_idx   on transaction(trx_date); 
 create        index tran_status_idx     on transaction(status); 
 [...;]

 [grant statements...;] 

 update statistics; 

If you have time, I CHALLENGE ANYONE TO TEST THIS!.. It's more noticeable when you have a large table.

+2  A: 

They're right. Joining on a CHAR(30) text field - particularly one containing person name data - will be slow, grossly inefficient and incredibly fragile. People do change their names (marriage is the obvious example), and multiple people can have the same name.

You want to create appropriate indexes on your tables to support the order you want data to appear in, and forget clustering. Your performance optimisation procedure sounds like a disaster looking for a place to happen. Sorry, but dropping/creating the tables like that is asking for trouble.

I would start with a UNIQUE INDEX on customer.id, a UNIQUE INDEX on transaction.ticket_number, and an INDEX (for performance rather than cardinality, so enforcing uniqueness not terribly important) on transactions (id, ticket_number DESC), and take it from there. Data is returned from the transaction table in the order it appears in the index.

I would only consider clustering when all other avenues of query optimisation have been exhausted.

RET
@RET- Performance was a lot slower with id join vs. name join. In this part of the world (Puerto Rico) women who marry dont change their names and if a customer was entered into the system with a particular name, they will always be accessed by that same name. Since I dont allow dups for customer names, plus we use surnames as part of the full name, plus we can further distinguish with other customer info. If dup happens, we add a suffix, but it has never happened. As for your comments on clustering, I totally disagree and why would I want to allow dups on ticket_number?
Frank Computer
I've clarified the reference to the duplicate index. Of course performance with id join would be slower if the tables weren't indexed to support it. I agree with the sentiments that Gary has expressed, which are more or less what I was saying.
RET
Consider an (unique if you insist) index on pk_name as well.
Jan B. Kjeldsen
A: 

You'll hit some problems with people with long names which won't fit in the CHAR(30), especially if you are including a full middle name.

I think you are overly concerned with clustering transactions by name. In the scenario you described, you pick a list of customers (so I can see some requirement for making customers easily accessible by name, though an index should be sufficient). Then the transactions are accessed for a particular customer, so it shouldn't matter whether they are clustered by the customer id or customer name.

Gary
@Gary- well, we actually dont use full middle names, just an initial followed by a period, so we really havent had a problem with a full name fitting in CHAR(30). When I experimented using id joins, naturally I created indexes to support them, after all, they are pk and fk. Keep in mind that when using pk/fk id, there's no transaction.name column to index on. I'm editing my question to include examples of how each tables data sits in the .dat files using pk/fk name and pk/fk id.
Frank Computer
@Frank: did you run the appropriate UPDATE STATISTICS after creating the indexes? Which server are you using? SE or IDS? IDS is fussier than SE about statistics.
Jonathan Leffler
@Jonathan- Standard Engine..I religiously run UPDATE STATISTICS [entire db] everytime I /drop/create/load/re-index. Processing time is instantaneous since my tables are small and indexes few. Did my data example clarify the issue I'm making concerning name vs. id when querying by cust name?
Frank Computer
@Jonathan- No rows get deleted in this app. transaction rows older than 1 year are unloaded to a separate file and loaded into another DSS db. Rows younger than a year get re-loaded back into production db.
Frank Computer
A: 

The number of records you have in your database is trivial for any of the products you mention. A properly structured database would have no problem returning the transactions by ID.

Properly structured in this case means that the ID column is a primary key in the customer table and a foreign key in the transaction table. In general foreign keys are automatically indexed, but if you're using a product in which that doesn't happen, the customer_id column in the transaction table must be indexed. Do not include the name field in the transaction table.

Assuming you're using an index, do not worry about the database "jumping all over the place". Database are not such simple pieces of software that they operate in that fashion.

Larry Lustig
@Larry- Well, I experimented using both methods of pk/fk joins and performance was palpably slower with pk_id/fk_id because transactions were not grouped in the same order as the customers name, which is how clerk queries to locate desired customer when query returns several rows with similar last names. Imagine if there were 50K customers and 200K transactions, it would most certainly be unacceptable, even with proper indexes. remember that indexes point to physical locations of rows in the data file, so the less the engine has to jump to retrieve/display, the better!
Frank Computer