views:

116

answers:

5

I have a table whose columns are varchar(50) and a float. I need to (very quickly) look get the float associated with a given string. Even with indexing, this is rather slow.

I know, however, that each string is associated with an integer, which I know at the time of lookup, so that each string maps to a unique integer, but each integer does not map to a unique string. One might think of it as a tree structure.

Is there anything to be gained by adding this integer to the table, indexing on it, and using a query like:

SELECT floatval FROM mytable WHERE phrase=givenstring AND assoc=givenint

This is Postgres, and if you could not tell, I have very little experience with databases.

A: 

By declaring an index on (phrase, assoc, floatval) you will get a "covering index", which allows the query posted in the question to performed without even accessing the table. Assuming that either phrase or assoc alone is highly selective (not many rows share the same value for the field), creating an index on that field alone should yield almost the same performance.

Generally, you will want to limit the number of indexes to the smallest set that gets your frequent queries up to the desired performance. For each index you add to a table, you pay some disk space, but more importantly you pay the price of having the DBMS do more work on each INSERT into the table.

Jørn Schou-Rode
PostgreSQL doesn't have covering indexes, so that index would definitely be a loss.
Magnus Hagander
@Magnus: So even if an index covers all the fields needed to answer a query, PostgreSQL will have to access the actual table to retrieve the values? Do you have a some reference for this? I am somewhat curious to know *why* :)
Jørn Schou-Rode
A: 

It couldn't hurt to try adding the int and making your index on int, varchar and include float - this would be covering and pretty efficient - not sure if Postgres has included columns - if it doesn't simply add it to the index itself.

There are several other techniques you could look into (I'm not familiar with all Postgres features, so I'll give them by SQL Server name):

Indexed views - you can effectively materialize a view which joins several tables - so you could join your varchar to your int and have your index on int and varchar and float

Included columns - you can include columns in an index to ensure that the index is covering - i.e. have an index on varchar include (float) - if your index isn't covering, the query optimizer is still going to have to use the index and then do a bookmark lookup to get the remaining data.

Cade Roux
`PostgreSQL` does not support indexed views or included columns, but it does support function based indexes (you don't have to materialize an expression to have it indexed).
Quassnoi
+4  A: 

Keys on VARCHAR columns can be very long which results in less records per page and more depth (more levels in the B-Tree). Longer indexes also increase the cache miss ratio.

How many strings in average map to each integer?

If there are relatively few, you can create an index only on integer column and PostgreSQL will do the fine filtering on records:

CREATE INDEX ix_mytable_assoc ON mytable (assoc);

SELECT  floatval
FROM    mytable
WHERE   assoc = givenint
        AND phrase = givenstring

You can also consider creating the index on the string hashes:

CREATE INDEX ix_mytable_md5 ON mytable (DECODE(MD5(phrase), 'HEX'));

SELECT  floatval
FROM    mytable
WHERE   DECODE(MD5(phrase), 'HEX') = DECODE(MD5('givenstring'), 'HEX')
        AND phrase = givenstring -- who knows when do we get a collision?

Each hash is only 16 bytes long, so the index keys will be much shorter while still preserving the selectiveness almost perfectly.

Quassnoi
Comparison of the index keys are also a lot more expensive with varchar, since they are locale aware. The integer index will definitely be much faster than any of the other options.
Magnus Hagander
@Magnus: the comparison should be made only `log(n)` times so I wouldn't call this "a lot" more expensive, but you're right, it does add some `CPU` cycles as well.
Quassnoi
A: 

I'd recommend simply a hash index:

create index mytable_phrase_idx on mytable using hash(phrase);

This way queries like

select floatval from mytable where phrase='foo bar';

will be very quick. Test this:

create temporary table test ( k varchar(50), v float);
insert into test (k, v) select 'foo bar number '||generate_series(1,1000000), 1;
analyze test;
explain analyze select v from test where k='foo bar number 634652';
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Index Scan using test_k_idx on test  (cost=0.00..8.45 rows=1 width=8) (actual time=0.201..0.206 rows=1 loops=1)
   Index Cond: ((k)::text = 'foo bar number 634652'::text)
 Total runtime: 0.265 ms
(3 rows)
Tometzky
A: 

Short answer: yes, there will be much to gain. At least as long as you don't have many updates, but it's quite likely that the overhead even there will not be noticable.

Magnus Hagander