views:

346

answers:

5

I'll preface this question by stating that I'm using Oracle 10g Enterprise edition and I'm relatively new to Oracle.

I've got a table with the following schema:

ID           integer (pk)  -- unique index
PERSON_ID    integer (fk)  -- b-tree index
NAME_PART    nvarchar      -- b-tree index
NAME_PART_ID integer (fk)  -- bitmap index

The PERSON_ID is the foreign key for the unique id of a person record. The NAME_PART_ID is the foreign key of a lookup table with static values like "First Name", "Middle Name", "Last Name", etc. The point of the table is to store individual parts of people's names separately. Every person record has at least a first name. When trying to pull the data out, I first considered using joins, like so:

select
    first_name.person_id,
    first_name.name_part,
    middle_name.name_part,
    last_name.name_part
from
    NAME_PARTS first_name
left join
    NAME_PARTS middle_name
    on first_name.person_id = middle_name.person_id
left join
    NAME_PARTS last_name
    on first_name.person_id = last_name.person_id
where
    first_name.name_part_id = 1
    and middle_name.name_part_id = 2
    and last_name.name_part_id = 3;

But the table has tens of millions of records, and the bitmap index for the NAME_PART_ID column isn't being used. The explain plan indicates that the optimizer is using full table scans and hash joins to retrieve the data.

Any suggestions?

EDIT: The reason the table was designed this way was because the database is used across several different cultures, each of which has different conventions for how individuals are named (e.g. in some middle-eastern cultures, individuals usually have a first name, then their father's name, then his father's name, etc). It is difficult to create one table with multiple columns that account for all of the cultural differences.

+5  A: 

Since you don't filter your tables in any way, the optimizer is probably right, HASH JOIN is the best way to join the unfiltered tables.

A bitmap index won't help you much in this case.

It's good for making OR's and AND's on multiple low-cardinality columns, not for pure filtering on a single column.

For this, a full table scan is almost always better.

Note that this is not the best design. I'd rather add the columns first_name, last_name and middle_name into person, building an index on each of the columns and making them NULLable.

In this case you have the same table as in your design, but without a table.

The indexes hold the name and the rowid just as well as a table does, and the join on rowid is much more efficient.

Update:

Myself being a member of a culture that uses father's name as a part of personal name, I can say that using three fields is enough for most cases.

One field for the family name, one field for a given name and one field for everything in between (without further specialization) is quite a decent way to handle names.

Just rely on your users. In the modern world, almost everyone knows how to fit their name into this schema.

For instance:

Family name:  Picasso
Given name:   Pablo
Middle name:  Diego José Francisco de Paula Juan Nepomuceno María de los Remedios Cipriano de la Santísima Trinidad Ruiz y

P. S. Did you know that close friends just called him PABLO~1 ?

Quassnoi
+1: First sentence pretty much says it all.
DCookie
Thanks for the suggestion. I thought about redesigning the table schema, an have added more details to my original question to explain why I didn't go with that option.
Kevin Babcock
I'd give you another +1 if I could for your P.S. :-D
DCookie
A: 

if you have millions of rows and an index that has 3 different values (for "First Name", "Middle Name", "Last Name") the index will be ignored because it is easier to scan all rows than use that index. an index needs to have a wide distribution of values to be of any use.

KM
-1: That's not necessarily true for bitmap indexes (which the name_part_id is). There's a pretty good white paper on the bitmap vs. btree index decision at http://www.oracle.com/technology/pub/articles/sharma_indexes.html.
DCookie
seems true for this query, bitmap or not. there is no magic that a bitmap index can do save on this query.
KM
The last sentence of your answer is untrue, which is the reason for my downvote.
DCookie
@DCookie, so how does the bitmap index save this query?
KM
He's not saying the bitmap index saves the query. He's saying you're wrong in saying that a bitmap index needs a wide distribution of values to be useful - this is, in fact, the exact opposite of the way bitmap indexes work - they're designed for columns with a small number of distinct values. As for this query, if you look at my examples above, there is at least one case where the bitmap index gets used, depending on how the query is written.
Steve Broberg
@Steve Broberg, what I'm saying that you can't use an bitmap index in the way the OP does and have it use some magic to save the query.
KM
A: 

Well, for one thing, on your second left join it looks like you're using "first_token" and "last_token" instead of "first_name" and "last_name." I'm guessing this is just a cut-and-paste error, though?

John Hyland
You are correct, it was a copy-and-past error. The schema creator actually named it token_1, token_2, etc, but I was renaming things to make the question clearer.
Kevin Babcock
+3  A: 

+1 for Quassnoi's answer, but let me add that although it won't help in this case (because you're retrieving so many records) this table may do well stored in a hash cluster on person_id so that the records for a single person are colocated in the same block. For retrieving few records that's a much faster structure than a heap table.

David Aldridge
Clustering these records will slow down the `DML`, but will certainly speed up retrieving person's names by `id`. Don't know if slowing down `DML` is acceptable in `@op`'s case, but this option certainly deserves considering. +1
Quassnoi
It is indeed a slower insert (though there's that bitma index which makes me wonder what's going on with DML there at all) but I always like to consider that we usually modify date much less often than we read it. Overheads on DML due to indexing or clustering etc ought to balance out with better select performance.
David Aldridge
+3  A: 

Given that you're essentially doing a full table scan anyway (as your query is extracting all data from this table, excluding the few rows that wouldn't have name parts that were first, middle or last), you may want to consider writing the query so that it just returns the data in a slightly different format, such as:

  SELECT person_id
       , name_part_id
       , name_part
    FROM NAME_PART
   WHERE name_part_id IN (1, 2, 3)
ORDER BY person_id
       , name_part_id;

Of course, you'll end up with 3 rows instead of one for each name, but it may be trivial for your client code to roll these together. You can also roll the 3 rows up into one by using decode, group by and max:

  SELECT person_id
       , max(decode(name_part_id, 1, name_part, null)) first
       , max(decode(name_part_id, 2, name_part, null)) middle
       , max(decode(name_part_id, 3, name_part, null)) last
    FROM NAME_PART
   WHERE name_part_id IN (1, 2, 3)
GROUP BY person_id
ORDER BY person_id;

This will produce results identical to your original query. Both versions will only scan the table once (with a sort), instead of dealing with the 3-way join. If you made the table an index-organized table on the person_id index, you'd save the sort step.

I ran a test with a table with 56,150 persons, and here's a rundown of the results:

Original query:

Execution Plan
----------------------------------------------------------

------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)|
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |   113K|    11M|       |  1364   (2)|
|*  1 |  HASH JOIN          |           |   113K|    11M|  2528K|  1364   (2)|
|*  2 |   TABLE ACCESS FULL | NAME_PART | 56150 |  1864K|       |   229   (3)|
|*  3 |   HASH JOIN         |           | 79792 |  5298K|  2528K|   706   (2)|
|*  4 |    TABLE ACCESS FULL| NAME_PART | 56150 |  1864K|       |   229   (3)|
|*  5 |    TABLE ACCESS FULL| NAME_PART | 56150 |  1864K|       |   229   (3)|
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("FIRST_NAME"."PERSON_ID"="LAST_NAME"."PERSON_ID")
   2 - filter("LAST_NAME"."NAME_PART_ID"=3)
   3 - access("FIRST_NAME"."PERSON_ID"="MIDDLE_NAME"."PERSON_ID")
   4 - filter("FIRST_NAME"."NAME_PART_ID"=1)
   5 - filter("MIDDLE_NAME"."NAME_PART_ID"=2)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       6740  consistent gets
          0  physical reads
          0  redo size
    5298174  bytes sent via SQL*Net to client
      26435  bytes received via SQL*Net from client
       3745  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      56150  rows processed

My query #1 (3 rows/person):

Execution Plan
----------------------------------------------------------

-----------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)|
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |   168K|  5593K|       |  1776   (2)|
|   1 |  SORT ORDER BY     |           |   168K|  5593K|    14M|  1776   (2)|
|*  2 |   TABLE ACCESS FULL| NAME_PART |   168K|  5593K|       |   230   (3)|
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("NAME_PART_ID"=1 OR "NAME_PART_ID"=2 OR "NAME_PART_ID"=3)

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1005  consistent gets
          0  physical reads
          0  redo size
    3799794  bytes sent via SQL*Net to client
      78837  bytes received via SQL*Net from client
      11231  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
     168450  rows processed

My query #2 (1 row/person):

Execution Plan
----------------------------------------------------------

-----------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)|
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           | 56150 |  1864K|       |  1115   (3)|
|   1 |  SORT GROUP BY     |           | 56150 |  1864K|  9728K|  1115   (3)|
|*  2 |   TABLE ACCESS FULL| NAME_PART |   168K|  5593K|       |   230   (3)|
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("NAME_PART_ID"=1 OR "NAME_PART_ID"=2 OR "NAME_PART_ID"=3)

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1005  consistent gets
          0  physical reads
          0  redo size
    5298159  bytes sent via SQL*Net to client
      26435  bytes received via SQL*Net from client
       3745  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
      56150  rows processed

Turns out, you can squeeze it a bit faster still; I tried to avoid the sort by adding an index hint to force the use of the person_id index. I managed to knock off another 10%, but it still looks like it's sorting:

  SELECT /*+ index(name_part,NAME_PART_person_id) */ person_id
       , max(decode(name_part_id, 1, name_part)) first
       , max(decode(name_part_id, 2, name_part)) middle
       , max(decode(name_part_id, 3, name_part)) last
    FROM name_part
   WHERE name_part_id IN (1, 2, 3)
GROUP BY person_id
ORDER BY person_id;

Execution Plan
----------------------------------------------------------

-----------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                  | Rows  | Bytes |TempSpc| Cost (%CPU)|
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                       | 56150 |  1864K|       |  3385   (1)|
|   1 |  SORT GROUP BY                 |                       | 56150 |  1864K|  9728K|  3385   (1)|
|   2 |   INLIST ITERATOR              |                       |       |       |       |            |
|   3 |    TABLE ACCESS BY INDEX ROWID | NAME_PART             |   168K|  5593K|       |  2500   (1)|
|   4 |     BITMAP CONVERSION TO ROWIDS|                       |       |       |       |            |
|*  5 |      BITMAP INDEX SINGLE VALUE | NAME_PART_NAME_PART_ID|       |       |       |            |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("NAME_PART_ID"=1 OR "NAME_PART_ID"=2 OR "NAME_PART_ID"=3)

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        971  consistent gets
          0  physical reads
          0  redo size
    5298159  bytes sent via SQL*Net to client
      26435  bytes received via SQL*Net from client
       3745  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
      56150  rows processed

However, the plans above are all based on the assumption that you're selecting from the entire table. If you constrain the results based on person_id (e.g., person_id between 55968 and 56000), it turns out that your original query with the hash joins is the fastest (27 vs. 106 consistent gets for the constraint I specified).

On the THIRD hand, if the queries above are being used to populate a GUI that uses a cursor to scroll over the result set (such that you would only see the first N rows of the result set initially - reproduced here by adding a "and rowcount < 50" predicate), my versions of the query once again become fast - very fast (4 consistent gets vs. 417).

The moral of the story is that it really depends exactly how you're accessing the data. Queries that work well on the entire result set may be worse when applied against different subsets.

Steve Broberg
+1, for the last paragraph if nothing else!
DCookie
+1 and green checked. Thanks for the excellent answer and insight!
Kevin Babcock