views:

241

answers:

8

I have a table with about 2,000,000 rows. I need to query one of the columns to retrieve the rows where a string exsists as part of the value.

When I run the query I will know the position of the string, but not before hand. So a view which takes a substring is not an option.

As far as I can see I have three options

  1. using like ‘% %’
  2. using instr
  3. using substr

I do have the option of creating a function based index, if I am nice to the dba.

At the moment all queries are taking about two seconds. Does anyone have experience of which of these options will work best, or if there is another option? The select will be used for deletes every few seconds, it will typically select 10 rows.

edit with some more info

The problem comes about as we are using a table for storing objects with arbitrary keys and values. The objects come from outside our system so we have limited scope to control them so the text column is something like 'key1=abc,key2=def,keyn=ghi' I know this is horribly denormalised but as we don't know what the keys will be (to some extent) it is a reliable way to store and retrieve values. To retrieve a row is fairly fast as we are searching the whole of the column, which is indexed. But the performance is not good if we want to retrieve the rows with key2=def.

We may be able to create a table with columns for the most common keys, but I was wondering if there was a way to improve performance with the existing set up.

+1  A: 

I suggest reconsidering your logic.

Instead of looking for where a string exists, it may be faster to check if it has a length of >0 and is not a string.

You can use the TRANSLATE function in oracle to convert all non string characters to nulls then check if the result is null.

John
Where did "only numeric" come from? I don't see anything in the question that would suggest that.
Dave Costa
+1  A: 

Can you provide a bit more information?

Are you querying for an arbitrary substring of a string column, or is there some syntax on the strings store in the columns that would allow for some preprocessing to minimize repeated work?

Have you already done any timing tests on your three options to determine their relative performance on the data you're querying?

joel.neely
+2  A: 

You can use Tom Kyte's runstats package to compare the performance of different implementations - running each say 1000 times in a loop. For example, I just compared LIKE with SUBSTR and it said that LIKE was faster, taking about 80% of the time of SUBSTR.

Note that "col LIKE '%xxx%'" is different from "SUBSTR(col,5,3) = 'xxx'". The equivalent LIKE would be:

col LIKE '____xxx%'

using one '_' for each leading character to be ignored.

I think whichever way you do it, the results will be similar - it always involves a full table (or perhaps full index) scan. A function-based index would only help if you knew the offset of the substring at the time of creating the index.

I am rather concerned when you say that "The select will be used for deletes every few seconds". This does rather suggest a design flaw somewhere, but without knowing the requirements it's hard to say.

UPDATE:

If your column values are like 'key1=abc,key2=def,keyn=ghi' then perhaps you could consider adding another table like this:

 create table key_values
    ( main_table_id references main_table
    , key_value varchar2(50)
    , primary key (fk_col, key_value)
    );

 create index key_values_idx on key_values (key_value);

Split the key values up and store them in this table like this:

main_table_id key_value
123           key1=abc
123           key2=def
123           key3=ghi

(This could be done in an AFTER INSERT trigger on main_table for example)

Then your delete could be:

delete main_table
where id in (select main_table_id from key_values
             where key_value = 'key2=def');
Tony Andrews
+2  A: 

In Oracle 10:

CREATE TABLE test (tst_test VARCHAR2(200));

CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))

SELECT  *
FROM    TEST
WHERE   REGEXP_REPLACE(REGEXP_SUBSTR(TST_TEST, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1') = 'TEST'

This will use newly selected index.

You will need as many indices as there are KEYs in you data.

Presence of an INDEX, of course, impacts performance, but it depends very little on REGEXP being there:

SQL> CREATE INDEX ix_test ON test (tst_test)
  2  /
Index created
Executed in 0,016 seconds

SQL> INSERT
  2  INTO   test (tst_test)
  3  SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
  4  FROM   dual
  5  CONNECT BY
  6     LEVEL <= 1000000
  7  /
1000000 rows inserted
Executed in 47,781 seconds

SQL> TRUNCATE TABLE test
  2  /
Table truncated
Executed in 2,546 seconds

SQL> DROP INDEX ix_test
  2  /
Index dropped
Executed in 0 seconds

SQL> CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))
  2  /
Index created
Executed in 0,015 seconds

 SQL> INSERT
      2  INTO   test (tst_test)
      3  SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
      4  FROM   dual
      5  CONNECT BY
      6     LEVEL <= 1000000
      7  /
1000000 rows inserted
Executed in 53,375 seconds

As you can see, on my not very fast machine (Core2 4300, 1 Gb RAM) you can insert 20000 records per second to an indexed field, and this rate almost does not depend on type of INDEX being used: plain or function based.

Quassnoi
I like this, we will test and see if it works for us. Thanks
Jeremy French
This probably works for the selects, but our dba pointed out that it will make inserts much slower as it has to build the index. Thanks for a very neet answer all the same.
Jeremy French
A: 

If you're always going to be looking for the same substring, then using INSTR and a function-based index makes sense to me. You could also do this if you have a small set of constant substrings you will be looking for, creating one FBI for each one.

Quassnoi's REGEXP idea looks promising too. I haven't used regular expressions inside Oracle yet.

I think that Oracle Text would be another way to go. Info on that here

Dave Costa
A: 

Not sure about improving existing setup stuff, but Lucene (full-text search library; ported to many platforms) can really help. There's extra burden of synchronizing index with the DB, but if you have anything that resembles a service layer in some programming language this becomes an easy task.

Anton Gogolev
+1  A: 

Separate answer to comment on the table design.

Can't you at least have a KEY/VALUE structure, so instead of storing in a single column, 'key1=abc,key2=def,keyn=ghi' you would have a child table like

KEY     VALUE
key1    abc
key2    def
key3    ghi

Then you can create a single index on key and value and your queries are much simpler (since I take it you are actually looking for an exact match on a given key's value).

Some people will probably comment that this is a horrible design, but I think it's better than what you have now.

Dave Costa
we may change the table so that it has columns for the most common keys. However we are using this to store values from another system so don't have a definitive list of keys. We are looking into using a materialszed view for this
Jeremy French
What I'm suggesting still allow storage of arbitrary key names. "key1", "key2", etc. would be values in a column, not columns in and of themselves.
Dave Costa
A: 

Similar to Anton Gogolev's response, Oracle does incorporate a text search engine documented here

There's also extensible indexing, so you can build your own index structures, documented here

As you've agreed, this is a very poor data structure, and I think you will struggle to achieve the aim of deleting stuff every few seconds. Depending on how this data gets input, I'd look at properly structuring the data on load, at least to the extent of having rows of "parent_id", "key_name", "key_value".

Gary