ansaurus

Question

String matching in Oracle 10g where either side can have wildcards

Answer 1

A:

Oracle 10g does have a regular expressions function which might help you in this situation. http://download.oracle.com/docs/cd/B19306%5F01/appdev.102/b14251/adfns%5Fregexp.htm

Additionally you could look at a java stored procedure if you have to do this in the databae.

I'm not aware of an index that will help you in this situation though given the _ can occur anywhere including the first character.

David 2009-08-14 17:07:32

Yes, I have looked at a regex solution as well. The issue with it is the significant overhead, while still having issues with wildcards on either side..2345 would become ^(.)(2|\.)(3|\.)(4|\.)(5|\.)$not optimized for clarity

Jason Pyeron 2009-08-14 17:41:25

Answer 2

A:

Is the mask only a single character ? If so you can limit the possibilities by something like

select VALS from tmp 
where specialmatch(vals,'12945678')
and (substr(vals,1,4) = substr('12945678',1,4) 
      or substr(vals,5) = substr('12945678',5));

Then you have function-based indexes on substr(vals,1,4) and substr(vals,5). I seem to recall reading that there may be an issue with FBIs not getting the best plan for these, so an alternative SQL would be

    select VALS from tmp 
    where specialmatch(vals,'12945678')
    and substr(vals,1,4) = substr('12945678',1,4)
    union 
    select VALS from tmp 
    where specialmatch(vals,'12945678')
    substr(vals,5) = substr('12945678',5));

Gary 2009-08-15 21:50:50

I like your idea, but there can be any number of wild cards, I should have made the test data reflect it.

Jason Pyeron 2009-08-16 13:31:51

Answer 3

A:

Next suggestions. Simple option : The _ is the single character match for LIKE, so the simple solution is

SELECT * FROM tmp WHERE vals LIKE v_param OR v_param LIKE vals;

It will be a full table scan each time, but saves the switching between SQL and PL/SQL layers

Complex option Bitmap indexes on substr for each individual character. That sort of multi-index tuff is what bitmaps are good at. Bitmaps are a bugger for columns with heavy updates or tables with lots of small inserts though.

I've built a test test. Firstly I've loaded 10,000 values into TMP, pretty much randomly generated. Not sure how big your data set will be, or the proportion of entries with no wildcards, one wildcard or multiple wildcards. That would have a big effect on the results.

create table tmp ( vals varchar(8),  mask varchar(8));

insert into tmp
select new_val, translate(new_val,'0123456789','__________')
from
  (select case 
         when rn_3  is not null then translate(val,'34','__')
         when rn_5  is not null then translate(val,'2','_')
         when rn_7  is not null then translate(val,'78','__')
         when rn_11 is not null then translate(val,'12345','_____')
         else val end new_val
  from
    (select lpad(trunc(dbms_random.value(1,99999999)),8,'0') val, 
            decode(mod(rownum,3),0,1) rn_3, decode(mod(rownum,5),0,1) rn_5, 
            decode(mod(rownum,7),0,1) rn_7, decode(mod(rownum,11),0,1) rn_11
     from dual connect by level < 10000)
  )

declare
  cursor c_1 is
    select case 
         when rn_3  is not null then translate(val,'34','__')
         when rn_5  is not null then translate(val,'2','_')
         when rn_7  is not null then translate(val,'78','__')
         when rn_11 is not null then translate(val,'12345','_____')
         else val end try_val
    from
      (select lpad(trunc(dbms_random.value(1,99999999)),8,'0') val, 
              decode(mod(rownum,3),0,1) rn_3, decode(mod(rownum,5),0,1) rn_5, 
              decode(mod(rownum,7),0,1) rn_7, decode(mod(rownum,11),0,1) rn_11
       from dual connect by level < 1000);
  v_cnt number;
  v_start number;
  v_end number;
begin
 v_start := dbms_utility.get_time;
 for c_rec in c_1 loop
    select count(*) into v_cnt 
    from tmp 
    where (c_rec.try_val like vals or vals like c_rec.try_val);
 end loop;
 v_end := dbms_utility.get_time;
 dbms_output.put_line('Meth 1 :'||(v_end - v_start));
 v_start := dbms_utility.get_time;
 for c_rec in c_1 loop
    select count(*) into v_cnt from
      (select * from (select * from tmp where mask = '        ') v1 
       where vals like c_rec.try_val
       union all
       select * from (select * from tmp where mask > '        ') v2 
       where vals like maskmerge(mask,c_rec.try_val));
 end loop;
 v_end := dbms_utility.get_time;
 dbms_output.put_line('Meth 2 :'||(v_end - v_start));
end;
/

I've compared the 'double headed LIKE' against mask merge. In the test, the LIKE typically got around 200-250 (hundreths of a second), while the maskmerge took around ten times longer. As I said, it will very much depend on the data distribution.

Gary 2009-08-17 00:09:53

the _ character was chose for that reason.I am googleing about bitmap indexes, but so far I am not sure how it would apply.

Jason Pyeron 2009-08-17 09:52:14

what do you think of the solution I finally figured out?It helps to use venn diagrams on a white board...

Jason Pyeron 2009-08-17 17:02:39

the dual like pattern fails for:select * from dual where '1_' like '_1' or '_1' like '1_';

Jason Pyeron 2009-08-18 14:16:22

Answer 4

A:

I "partitioned" the table into two distinct sets, those who do not have a mask (v1) and those who do (v2)

select * from (select * from tmp where mask = '        ') v1 where vals like :srch
 union all
select * from (select * from tmp where mask > '        ') v2 where vals like maskmerge(mask,:srch);

Now, the optimizer says:

Operation                                Object Name       Rows Bytes Cost
SELECT STATEMENT Optimizer Mode=ALL_ROWS                      2          5    
  UNION-ALL
    TABLE ACCESS BY INDEX ROWID          SCHEMA.TMP           1    90    2  
      INDEX RANGE SCAN                   SCHEMA.I_TMP_MASK    1          1
    TABLE ACCESS BY INDEX ROWID          SCHEMA.TMP           1    90    3
      INDEX RANGE SCAN                   SCHEMA.I_TMP_MASK    2          1

Which is very good, even when my :srch has wild cards in it Oracle can optimize away.

In the end standard indexes on vals and mask cols was sufficient to do the trick, even without hints. Tested on 10g. Note: we use union all since, v1 and v2 are always mutually exclusive.

Fyi:

CREATE OR REPLACE FUNCTION maskmerge (A IN VARCHAR, B IN VARCHAR)
 RETURN VARCHAR deterministic parallel_enable
 IS  
 alen int;
 blen int;
 mlen int;   
 res varchar(4000);
 ca char;
 cb char;
BEGIN
 if (a is null) then
  return b;
 end if;
 if (b is null) then
  return a;
 end if;
 alen:=length(a);
 blen:=length(b);
 if (alen<blen) then
  mlen:=alen;
 else
  mlen:=blen;
 end if;
 for i in 1 .. mlen loop
  ca:=substr(a,i,1);
  cb:=substr(b,i,1);
  if (ca='_' or cb='_') then
   res:=res||'_';
  elsif (ca=' ') then
   res:=res||cb;
  elsif (cb=' ') then
   res:=res||ca;
  else
   res:=res||cb;
  end if;
 end loop;
 return res;
END;

Full test case (with typical data distribution):

-----------------------------------------------------------------

CREATE OR REPLACE FUNCTION maskmerge (A IN VARCHAR, B IN VARCHAR)
 RETURN VARCHAR deterministic parallel_enable
 IS  
 alen int;
 blen int;
 mlen int;   
 res varchar(4000);
 ca char;
 cb char;
BEGIN
 if (a is null) then
  return b;
 end if;
 if (b is null) then
  return a;
 end if;
 alen:=length(a);
 blen:=length(b);
 if (alen<blen) then
  mlen:=alen;
 else
  mlen:=blen;
 end if;
 for i in 1 .. mlen loop
  ca:=substr(a,i,1);
  cb:=substr(b,i,1);
  if (ca='_' or cb='_') then
   res:=res||'_';
  elsif (ca=' ') then
   res:=res||cb;
  elsif (cb=' ') then
   res:=res||ca;
  else
   res:=res||cb;
  end if;
 end loop;
 return res;
END;
/

create table tmp
(
id int not null primary key,
ipv6address varchar(32) not null,
ipv6addressmask varchar(32) default ('                                ') not null
);

create sequence s_tmp;

create index i_tmp_addr on tmp(ipv6address);

create index i_tmp_mask on tmp(ipv6addressmask);

create or replace trigger t_i_tmp before insert on tmp referencing new as new old as old FOR EACH ROW
DECLARE
    tmpVar tmp.id%TYPE;
begin
SELECT s_tmp.NEXTVAL INTO tmpVar FROM dual;
:new.id:=tmpVar;
end;

exec dbms_random.initialize(17809465);

insert into tmp (ipv6address) 
select decode(trunc(dbms_random.value(0,2)),0,'20010db80000000000000000',1,'00000000000000000000ffff','00000000000000000000ffff') 
||trim(to_char(dbms_random.value(0, 4294967296),'0000000x')) 
as val from dual
connect by level <= 10000;

insert into tmp
SELECT * FROM
( SELECT * FROM tmp
ORDER BY dbms_random.value )
WHERE rownum <= 200;

insert into tmp values (null,'00000000000000000000ffff12345678','                                ');

insert into tmp values (null,'00000000000000000000ffff12345678','                              _ ');

insert into tmp values (null,'00000000000000000000ffff1234567_','                              __');

--select * from tmp order by ipv6address

-- network redaction of ipv4 
update tmp set ipv6addressmask=maskmerge('                        ______  ',ipv6addressmask),ipv6address=maskmerge('                        ______  ',ipv6address) where length(ipv6address)/32*dbms_random.value<0.005;

-- host redaction of ipv4 
update tmp set ipv6addressmask=maskmerge('                              __',ipv6addressmask),ipv6address=maskmerge('                              __',ipv6address) where length(ipv6address)/32*dbms_random.value<0.005;

-- full redaction of ipv4 
update tmp set ipv6addressmask=maskmerge('                              __',ipv6addressmask),ipv6address=maskmerge('                              __',ipv6address) where ipv6addressmask='                        ______  ' and length(ipv6address)/32*dbms_random.value<0.04;

-- network report redaction of ipv4 
update tmp set ipv6addressmask=maskmerge('                        ______  ',ipv6addressmask) where length(ipv6address)/32*dbms_random.value<0.005;

-- host report redaction of ipv4 
update tmp set ipv6addressmask=maskmerge('                              __',ipv6addressmask) where length(ipv6address)/32*dbms_random.value<0.005;

-- full report redaction of ipv4 
update tmp set ipv6addressmask=maskmerge('                              __',ipv6addressmask) where ipv6addressmask='                        ______  ' and length(ipv6address)/32*dbms_random.value<0.04;

select count(*) from tmp where instr(ipv6address,'_')>0;

select count(*) from tmp where ipv6addressmask > '                                '; 

-- srch := '00000000000000000000ffff12345678';  

select * from (select * from tmp where ipv6addressmask = '                                ') v1 where ipv6address like :srch
 union all
select * from (select * from tmp where ipv6addressmask > '                                ') v2 where ipv6address like maskmerge(ipv6addressmask,:srch);

/*
Operation                                Object Name Rows Bytes Cost
---------------------------------------- ----------- ---- ----- ----
SELECT STATEMENT Optimizer Mode=ALL_ROWS              510         29                            
  UNION-ALL                                 
    TABLE ACCESS BY INDEX ROWID          TMP          500   23K   10                            
      INDEX RANGE SCAN                   I_TMP_ADDR    92          2                            
    TABLE ACCESS BY INDEX ROWID          TMP           10   490   19                            
      INDEX RANGE SCAN                   I_TMP_MASK   207          2                            

*/

SELECT * FROM tmp WHERE ipv6address LIKE :srch OR :srch LIKE ipv6address

/*

Operation                                Object Name Rows Bytes Cost
---------------------------------------- ----------- ---- ----- ----
SELECT STATEMENT Optimizer Mode=ALL_ROWS              995         22                            
  TABLE ACCESS FULL                      TMP          995   47K   22                            

*/

-----------------------------------------------------------------

drop table tmp;

drop sequence s_tmp;

drop function maskmerge;

-----------------------------------------------------------------

Jason Pyeron 2009-08-17 17:01:09

Answer 5

A:

FYI, I have found that when you need to do matches on %var% and have it work fast against lots of data the best way is to use Oracle's Oracle Text Indexes.

Tom Clark AaiCanHelp.com 2009-08-18 19:51:40

while true it does carry a significant burden of non-standard query syntax and overhead of periodic index recalculations. At this point the data is well indexed and no full table scan are occuring. see the explain plan in my chosen solution.

Jason Pyeron 2009-08-20 14:57:17

ansaurus

tags:

views:

answers:

String matching in Oracle 10g where either side can have wildcards

related questions