tags:

views:

50

answers:

1

Is it possible to use a back reference to specify the number of replications in a regular expression?

foo= 'ADCKAL+2AG.+2AG.+2AG.+2AGGG+.G+3AGGa.'

The substrings that start with '+[0-9]' followed by '[A-z]{n}.' need to be replaced with simply '+' where the variable n is the digit from earlier in the substring. Can that n be back referenced? For example (doesn't work) '+([0-9])[A-z]{/1}.' is the pattern I want replaced with "+" (that last dot can be any character and represents a quality score) so that foo should come out to ADCKAL+++G.G+.

 import re
 foo = 'ADCKAL+2AG.+2AG.+2AG.+2AGGG+.+G+3AGGa.'
 indelpatt = re.compile('\+([0-9])')
 while indelpatt.search(foo):
     indelsize=int(indelpatt.search(foo).group(1))
     new_regex = '\+%s[ACGTNacgtn]{%s}.' % (indelsize,indelsize)
     newpatt=re.compile(new_regex)
     foo = newpatt.sub("+", foo)

I'm probably missing an easier way to parse the string.

+1  A: 

No, you cannot use back-references as quantifiers. A workaround is to construct a regular expression that can handle each of the cases in an alternation.

import re

foo = 'ADCKAL+2AG.+2AG.+2AG.+2AGGG^+.+G+3AGGa4.'
pattern = '|'.join('\+%s[ACGTNacgtn]{%s}.' % (i, i) for i in range(1, 10))
regex = re.compile(pattern)
foo = regex.sub("+", foo)
print foo

Result:

ADCKAL++++G^+.+G+4.

Note also that your code contains an error that causes it to enter an infinite loop on the input you gave.

Mark Byers
Thanks! Your solution works great. I'll change my example code, tried to edit out some additional information. Sorry about that.
jeffhsu3