views:

345

answers:

3

This is more of a generic SQL problem but I'm using Firebird 2.5 if anyone knows of a Firebird/Interbase specific optimization. First, the following is a simplified example schema to illustrate the issue I'm trying to solve:

CREATE TABLE users
(
   id INTEGER PRIMARY KEY,
   name VARCHAR(16)
);

CREATE TABLE data_set
(
   id INTEGER PRIMARY KEY,
   name VARCHAR(64)
);

CREATE UNIQUE INDEX data_set_name_idx ON data_set(name);

CREATE TABLE data
(
   user_id INTEGER,
   data_set_id INTEGER,
   data BLOB,
   PRIMARY KEY(user_id, data_set_id)
);

CREATE INDEX data_user_id_idx ON data(user_id);
CREATE INDEX data_data_set_id_idx ON data(data_set_id);

The query I'm trying to run is as follows:

SELECT users.name, data_set.name, data FROM users, data_set, data
WHERE user_id=XXX AND user_id=users.id AND data_set_id=data_set.id
ORDER BY data_set.name;

With 'XXX' being filled in with the *user_id* I want. So what I'm doing is selecting all the rows from the data table that are owned by a particular user and I'm sorting the results based on the *data_set* name.

This works as it is but the problem is the data table has over a billion rows in it and the *data_set* table is not small either. The result set for a single user id may be many hundreds of millions of rows. What happens is that in order for the ORDER BY to work the database has to create a massive amount of temporary data which is incredibly slow and uses a lot of disk space. Without the ORDER BY it's fast but obviously not sorted like I need.

One solution would be to take the *data_set.name* values and just put them in a varchar column in data. Then that could be indexed and would be quick to sort. The problem with this approach is that it will have a lot of duplicate data and make the database absolutely massive.

Another solution would be something like an Indexed View or an indexed Computed Column. As far as I know neither of those is supported by Firebird.

Any other ideas?

A: 

Why not index data_set.name?

Also, I would skip primary key definition for facts table (data) and put two separate indexes for foreign keys, to speed up joins. (of course, indexing may have impact on insertions, if you have a large number of records to be inserted).

If you need to ensure unique constraint of the facts table, you could do it from the job transferring data into that table (I don't have details about that :).

Cătălin Pitiș
Sorry, I forgot to include that. Yes, data_set.name is indexed. I'll update the example schema. Indexing it does not fix the problem though.
CR
You should then check the execution plan. It might be that you have to go to database specific optimization...
Cătălin Pitiș
As far as the execution plan, the database is basically taking the entire unsorted result set and putting it in a temporary table so it can then index the ORDER BY column. Essentially the same thing as if I added a column to *data* that contained the *data_set.name*, except just on the result set. Due to the size of the result set this is very slow.I'm wondering if there is some way to create an index in *data* on the foreign key reference except using the value it references (data_set.name) instead of the integer value.
CR
+1  A: 

This is rather speculative, but I wonder if this could be restructured as:

  1. A cartesian product between users and data set, including the predicate on user.
  2. Order by data set name
  3. Joining to data

... would be more efficient, especially if you were only interested in the forst rows of the query.

In Oracle I'd think that it would not be because the nested loop join would be much less efficient than a hash join, but I'm not familiar with firebird at all I'm afraid.

David Aldridge
A: 

Try to define an index on data_set(id, name) and experiment with it - maybe in combination with other suggestions here. You can change your existing UNIQUE index to a UNIQUE CONSTRAINT if your requirements dictate it and Firebird supports unique constraints.

liggett78