views:

53

answers:

1

I need to extract some data from malformed XML stored in an Oracle database. The XPath expressions would look like this: //image/type/text(). One take at a regular expression which would work in a similar fashion would be <image>.*?<type>(.+?)<\/type> (with appropriate flags for multiline matching).

Since Oracle does not support match groups in any form for REGEXP_SUBSTR I am unsure how to extract a set (with potentially n > 1 members) of match groups from an Oracle CLOB column. Any ideas?

A: 

AFAIK you can't extract a set with Oracle regex functions direcly, but you can iterate through the string calling regex_substr function and saving result to collection (or whatever you need) as a workaround, something like that:

...
fOccurence := 0;
loop
  fSubstr := regex_substr(fSourceStr, '<image>.*?<type>(.+?)<\/type>', 1, fOccurence, 'gci');
  exit when fSubstr is null;
  fOccurence := fOccurence + 1;
  fResultStr := fResultStr || fSubstr;
end loop;
...
andr
But `regexp_substr` won't capture the match group but the whole of the regular expression ...
yawn
@yawn Pardon, what do you mean saying "won't capture the match group"?
andr
@andr: I need only the (.+)? part of the string - which is the first match group.
yawn
@yawn I got it. Well, I think you can replace the 4th line in my answer with the following: `fSubstr := regexp_replace(fSourceStr, '(<image>.*?<type>)(.+?)(<\/type>), '\2', 1, fOccurence, 'gci');` Note: the regexp changed a little.
andr
@andr: The flags are `nim` (instead of `gci`). The main problem with your solution remains that I need to match everything in order to replace everything with the captured result. This makes it quite difficult to capture multiple occurrences. I accept your answer nevertheless, using backrefs seems to be the only way.
yawn