views:

103

answers:

2

I have identified the query constructs my users normally use. Would it make sense for me to create composite indexes to support those constructs and provide FIRST_ROWS capability?

If I migrate from SE to IDS, I will lose the ability to write low-level functions with C-ISAM calls, but gain FIRST_ROWS along with other goodies like: SET-READS for index scans (onconfig USE_[KO]BATCHEDREAD), optimizer directives, parallel queries, etc.


Information from Comments

Pawnshop production tables are queried by: customer.name char(30) using wildcards (LAS*SUR*F* to find LASTNAME SURNAME, FIRSTNAME) or queried by pawns.ticket_number INT. Customer and pawns are joined by: customer.name = pawns.name, not customer.serial = pawns.fk. Pawns with trx date older than 1 year are moved to historical table (>500K nrows) in a different database, on another hard disk. Index on historical is by trx_date descending. This is where the ad-hoc composite query constructs come into play.

Once a customer's pawn transaction is found, the row is updated when an intrest or redeem pymt is made by the customer. If customers don't make a pymt in 90 days, users will mananually update which pawns they will forfeit. pawns.status changes to inactive when a customer redeems a pawn or forfeits it for lack of pymt. inactives are moved out of pawns table into historical table when their trx dates are older than 1 year, so no mass-updating occurs in this app. Pawnshops run this proc every morning before opening business.

{ISQL 2.10.06E (SE-DOS16M protected mode) pawns table optimization - 
 once-daily, before start of business, procedure}

 unload to "U:\UNL\ACTIVES.UNL"
    select * from pawns where pawns.status = "A"
  order by pawns.cust_name, pawns.trx_date;

 unload to "U:\UNL\INACTIVE.UNL"
    select * from pawns
     where pawns.status <> "A"
       and pawns.trx_date >= (today - 365)
  order by pawns.cust_name, pawns.trx_date desc;

 unload to "U:\UNL\HISTORIC.UNL"
    select * from pawns
     where pawns.status <> "A"
       and pawns.trx_date < (today - 365)
  order by pawns.trx_date desc;

 drop table pawns;

 create table pawns
 (
     trx_num serial,
     cust_name char(30),
     status char(1),
     trx_date date,
 . . . ) in "S:\PAWNSHOP.DBS\PAWNS";

 load from "U:\UNL\ACTIVES.UNL" insert into pawns;         {500:600 nrows avg.}
 load from "U:\UNL\INACTIVE.UNL" insert into pawns;        {6500:7000 nrows avg.}
 load from "U:\UNL\HISTORIC.UNL" insert into dss:historic; {>500K nrows}

 create cluster index pa_cust_idx on pawns (cust_name);

 {this groups each customers pawns together, actives in
  oldest trx_date order first, then inactive pawns within the last year in most 
  recent trx_date order. inactives older than 1 year are loaded into historic 
  table in a separate database, on a separate hard disk. historic table 
  optimization is done on a weekly basis for DSS queries.}

 create unique index pa_trx_num_idx on pawns (trx_num);
 create index pa_trx_date_idx on pawns (trx_date);
 create index pa_status_idx on pawns (status);

 {grant statements...}

 update statistics;
+1  A: 

There isn't a simple yes/no answer - it is a balancing act, as with so many performance issues.

There are two main costs associated with indexes which must be balanced against the benefits.

  1. Indexes must be maintained as rows are added, deleted, modified in the table. The cost is not huge, but neither is it negligible.
  2. Indexes occupy disk space.

There is also a small overhead when queries are optimized simply because there are more indexes to consider.

The primary benefit of good indexes is vastly improved performance on selecting data when the index can be used to good effect.

If your tables are not very volatile and are frequently searched with criteria where the indexes can help, then it probably makes sense to create the composite indexes, assuming that disk space is not an issue.

If your tables are very volatile, or if a specific index will seldom be used (but is beneficial on those few occasions when it is used), then you should perhaps weigh the almost one-off cost of a slower query against the cost of storing and maintaining the index for those few occasions when it can be used.

There is a quite good book on the subject of index design: Relational Database Index Design and the Optimizers by Lahdenmäki and Leach (it is also fairly expensive).


In the latest comment, Frank says:

[L]ooking for a couple of things. As its already been said, the simplest thing to do is to allow Informix to start returning rows once it has them. (Oracle does this by default.) The larger picture to what Frank is asking for is something similar to what Google has. Ok it really goes back to Alta Vista and the 90's when talking about search indexes on the web. The idea is that you can do a quick search, pick up the first n things while reporting a "number" of rows returned in the search. (As if the number reported by Google is accurate.)

This additional comment from Frank makes more sense in the context of the question for which this is a continuation.

Obviously, unless the SQL statement forces Informix to do a sort, it makes results available as soon as it has them; it always has. The FIRST_ROWS optimization hint indicates to IDS that if it has a choice of two query plans and one will let it produce the first rows more quickly than the other, then it should prefer the one that produces the first rows quickly, even if it is more expensive overall than the alternative. Even in the absence of the hint, IDS still tries to make the data available as quickly as possible - it just also tries to do it as efficiently as possible too.

When the query is prepared, you get an estimate of how many rows may be returned - you could use that as an indicator (a few, quite a lot, very many). Separately, you can quickly and independently discover the number of rows in the main table you are searching. Given this metadata, you can certainly use a technique with a scroll cursor to give you a backing store in the database that contains the primary key values of the rows you are interested in. At any time, you can load an array with the display data for a set of interesting rows for display to the user. On user request, you can arrange to display another page full of information. At some point in the proceedings, you will find that you've reached the end of the data in the scroll cursor. Clearly, if you do FETCH LAST, you force that to happen. If you just do a few more FETCH NEXTs, then you will eventually get a NOTFOUND condition.

All of this has been possible with Informix (IDS and its prior incarnations, OnLine, Turbo, SE, plus I4GL) since the late 80s. The FIRST_ROWS optimization is more recent; it is still just a hint to the optimizer, and usually makes little difference to what the optimizer does.

Jonathan Leffler
pawnshop production tables are queried by: customer.name char(30) using wildcards (LAS*SUR*F* to find LASTNAME SURNAME, FIRSTNAME) or queried by pawns.ticket_number INT. customer and pawns are joined by: customer.name = pawns.name, not customer.serial = pawns.fk.. pawns with trx date older than 1 year are moved to historical table (>500K nrows) in a different database, on another hard disk. index on historical is by trx_date descending. this is where the ad-hoc composite query constructs.
Frank Computer
Once a customers pawn transaction is found, the row is updated when an interest or redeem pymt is made by the customer. if customers dont make a pymt in 90 days, then the pawn is forfeited, row is updated, pawns status changes to inactive and eventually moved to historical table after one year, so no mass-updating occurs in this app. pawnshops run this proc every morning before opening business: (see in next 3-part comments, space is running out).
Frank Computer
{ISQL 2.10.06E (SE-DOS) pawns table optimization - daily startup procedure}unload to "U:\UNL\ACTIVES.UNL" select * from pawns where pawns.status = "A" order by pawns.cust_name, pawns.trx_date;unload to "U:\UNL\INACTIVE.UNL" select * from pawns where pawns.status IN ("I","F","R","T") and pawns.trx_date >= (today - 365) order by pawns.cust_name, pawns.trx_date desc;unload to "U:\UNL\HISTORIC.UNL" select * from pawns where pawns.status IN ("I","F","R","T") and pawns.trx_date <= (today - 365)
Frank Computer
drop table pawns;create table pawns ( trx_num serial, cust_name char(30), status char(1), trx_date date, . . . ) in "S:\PAWNSHOP.DBS\PAWNS";load from "U:\UNL\ACTIVES.UNL" insert into pawns;load from "U:\UNL\INACTIVE.UNL" insert into pawns;load from "U:\UNL\HISTORIC.UNL" insert into historic;create cluster index pa_cust_idx on pawns (cust_name); {this groups each customers pawns together, actives in oldest trx_date order first, then}
Frank Computer
{inactive pawns within the last year in most recent trx_date order. inactives older than 1 year are loaded into historic table on a seperate hard disk. historic table optimization is done on a weekly basis for DSS queries.} create unique index pa_trx_num_idx on pawns (trx_num);create index pa_trx_date_idx on pawns (trx_date);create index pa_status_idx on pawns (status);{grant statements...}update statistics;
Frank Computer
Frank is looking for a couple of things. As its already been said, the simplest thing to do is to allow Informix to start returning rows once it has them. (Oracle does this by default.) The larger picture to what Frank is asking for is something similar to what Google has. Ok it really goes back to Alta Vista and the 90's when talking about search indexes on the web. The idea is that you can do a quick search, pick up the first n things while reporting a "number" of rows returned in the search. (As if the number reported by Google is accurate.)
Frank Computer
A: 

FYI, Related information about INDEX STRIDE functionality: Index Stride provides a user with simultaneously refining estimates for each group in a GROUP BY query. Without the access method, the distinct confidence intervals could not be refined concurrently, and the relative rates for the groups could not be controlled on the fly. To understand the behavior of Index Stride, consider the following query:

SELECT ONLINE college, avg(grade), count(*) FROM grades

GROUP BY college

Index Stride first probes the index to find all the groups (colleges). It then processes rows from each group in a "round robin" schedule, improving each college's confidence interval and updating its running estimate. The user can manipulate the schedule at which different groups are processed by increasing, decreasing, or stopping the amount of time spend sampling each group as results arrive. The user can stop a group, increase processing time for another and view the updates for all groups simultaneously.

"...IDS is divided into two parts: a storage manager called RSAM, and a query optimization/execution engine implemented on top of RSAM, and has a well-defined, published API...Implementing Index Stride above RSAM resulted in some compromises in performance. RSAM presents interfaces for scanning relations, and for fetching rows from indexes. The index-based interface to RSAM takes as arguments a relation R, an index I , and a predicate p, and returns rows from R that match p. To amortize I/O costs, an efficient implementation of Index Stride would fetch rows a disk block at a time: one block of rows from the first group, then one from the second, and so on. Since the RSAM interface is row-at-a-time, there is no way to explicitly request a block of rows. In principle the buffer manager could solve this problem by allocating a buffer per group: then when fetching the first row from a block it would put the block in the buffer pool, and it would be available on the next fetch. Unfortunately this buffering strategy does not correspond naturally to the replacement policies in use in a traditional DBMS. The performance of Index Stride could be tuned up by either implementing a page-based Index Stride in RSAM, or enhancing the buffer manager to recognize and optimize Index Stride-style access."

Frank Computer
@Frank: this answer should be added to your question, and then deleted.
John Saunders