tags:

views:

167

answers:

4

Situation:

  • text: a string
  • R: a regex that matches part of the string. This might be expensive to calculate.

I want to both delete the R-matches from the text, and see what they actually contain. Currently, I do this like:

import re
ab_re = re.compile("[ab]")
text="abcdedfe falijbijie bbbb laifsjelifjl"
ab_re.findall(text)
# ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
ab_re.sub('',text)
# 'cdedfe flijijie  lifsjelifjl'

This runs the regex twice, near as I can tell. Is there a technique to do it all on pass, perhaps using re.split? It seems like with split based solutions I'd need to do the regex at least twice as well.

A: 

You could use split with capturing parantheses. If you do, then the text of all groups in the pattern are also returned as part of the resulting list (from python doc).

So the code would be

import re
ab_re = re.compile("([ab])")
text="abcdedfe falijbijie bbbb laifsjelifjl"
matches = ab_re.split(text)
# matches = ['', 'a', '', 'b', 'cdedfe f', 'a', 'lij', 'b', 'ijie ', 'b', '', 'b', '', 'b', '', 'b', ' l', 'a', 'ifsjelifjl']

# now extract the matches
Rmatches = []
remaining = []
for i in range(1, len(matches), 2):
    Rmatches.append(matches[i])
# Rmatches = ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']

for i in range(0, len(matches), 2):
    remaining.append(matches[i])
remainingtext = ''.join(remaining)
# remainingtext = 'cdedfe flijijie  lifsjelifjl'
Hamish Downer
All the "if text == a" code here implements the regex a second time. If the regex was simple as [ab], then this whole question would be moot. :) Good effort though, and it spurs my thinking a bit, into filtering solutions.
Gregg Lind
Yep, fixed it up by noticing that the matches has alternating match and discarded text, including the empty strings, so the above solution is simpler and the regex is only run once :)
Hamish Downer
Fair enough. Slicing is simpler though :) I had the same realization about the alternating token-match thing too! Thanks for the hint.
Gregg Lind
+4  A: 
import re

r = re.compile("[ab]")
text = "abcdedfe falijbijie bbbb laifsjelifjl"

matches = []
replaced = []
pos = 0
for m in r.finditer(text):
    matches.append(m.group(0))
    replaced.append(text[pos:m.start()])
    pos = m.end()
replaced.append(text[pos:])

print matches
print ''.join(replaced)

Outputs:

['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
cdedfe flijijie  lifsjelifjl
Deestan
You can use a list in place of StringIO, and join that in the end, if you want to keep it simple.
Tomalak
Tomalak: Yes, that would be simpler. A bit of test profiling shows that it is actually faster too, at least on my test input.
Deestan
Does not really surprise me. I was typing up essentially the same thing, you just happened to be faster. ;-)
Tomalak
+3  A: 

My revised answer, using re.split(), which does things in one regex pass:

import re
text="abcdedfe falijbijie bbbb laifsjelifjl"
ab_re = re.compile("([ab])")
tokens = ab_re.split(text)
non_matches = tokens[0::2]
matches = tokens[1::2]

(edit: here is a complete function version)

def split_matches(text,compiled_re):
    ''' given  a compiled re, split a text 
    into matching and nonmatching sections
    returns m, n_m, two lists
    '''
    tokens = compiled_re.split(text)
    matches = tokens[1::2]
    non_matches = tokens[0::2]
    return matches,non_matches

m,nm = split_matches(text,ab_re)
''.join(nm) # equivalent to ab_re.sub('',text)
Gregg Lind
Note that the compiled re must be a 'capturing re' with parens around the whole mess, or this won't work properly.
Gregg Lind
Hmm? Works for me without parentheses.
Deestan
+4  A: 

What about this:

import re

text = "abcdedfe falijbijie bbbb laifsjelifjl"
matches = []

ab_re = re.compile( "[ab]" )

def verboseTest( m ):
    matches.append( m.group(0) )
    return ''

textWithoutMatches = ab_re.sub( verboseTest, text )

print matches
# ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
print textWithoutMatches
# cdedfe flijijie  lifsjelifjl

The 'repl' argument of the re.sub function can be a function so you can report or save the matches from there and whatever the function returns is what 'sub' will substitute.

The function could easily be modified to do a lot more too! Check out the re module documentation on docs.python.org for more information on what else is possible.

Jon Cage
That's a very clever solution. I didn't know you could use functions as a first argument to the sub function.
Gregg Lind
Thanks, I was rather chuffed with it's simplicity when I realised you could call a function :-)
Jon Cage