ansaurus

Question

Capture the contents of a regex and delete them, efficiently.

Answer 1

A:

You could use split with capturing parantheses. If you do, then the text of all groups in the pattern are also returned as part of the resulting list (from python doc).

So the code would be

import re
ab_re = re.compile("([ab])")
text="abcdedfe falijbijie bbbb laifsjelifjl"
matches = ab_re.split(text)
# matches = ['', 'a', '', 'b', 'cdedfe f', 'a', 'lij', 'b', 'ijie ', 'b', '', 'b', '', 'b', '', 'b', ' l', 'a', 'ifsjelifjl']

# now extract the matches
Rmatches = []
remaining = []
for i in range(1, len(matches), 2):
    Rmatches.append(matches[i])
# Rmatches = ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']

for i in range(0, len(matches), 2):
    remaining.append(matches[i])
remainingtext = ''.join(remaining)
# remainingtext = 'cdedfe flijijie  lifsjelifjl'

Hamish Downer 2008-10-15 14:36:18

All the "if text == a" code here implements the regex a second time. If the regex was simple as [ab], then this whole question would be moot. :) Good effort though, and it spurs my thinking a bit, into filtering solutions.

Gregg Lind 2008-10-15 14:50:53

Yep, fixed it up by noticing that the matches has alternating match and discarded text, including the empty strings, so the above solution is simpler and the regex is only run once :)

Hamish Downer 2008-10-15 14:54:59

Fair enough. Slicing is simpler though :) I had the same realization about the alternating token-match thing too! Thanks for the hint.

Gregg Lind 2008-10-15 15:02:43

Answer 2

+4 A:

import re

r = re.compile("[ab]")
text = "abcdedfe falijbijie bbbb laifsjelifjl"

matches = []
replaced = []
pos = 0
for m in r.finditer(text):
    matches.append(m.group(0))
    replaced.append(text[pos:m.start()])
    pos = m.end()
replaced.append(text[pos:])

print matches
print ''.join(replaced)

Outputs:

['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
cdedfe flijijie  lifsjelifjl

Deestan 2008-10-15 14:41:51

You can use a list in place of StringIO, and join that in the end, if you want to keep it simple.

Tomalak 2008-10-15 14:50:32

Tomalak: Yes, that would be simpler. A bit of test profiling shows that it is actually faster too, at least on my test input.

Deestan 2008-10-15 15:08:59

Does not really surprise me. I was typing up essentially the same thing, you just happened to be faster. ;-)

Tomalak 2008-10-16 12:05:02

Answer 3

+3 A:

My revised answer, using re.split(), which does things in one regex pass:

import re
text="abcdedfe falijbijie bbbb laifsjelifjl"
ab_re = re.compile("([ab])")
tokens = ab_re.split(text)
non_matches = tokens[0::2]
matches = tokens[1::2]

(edit: here is a complete function version)

def split_matches(text,compiled_re):
    ''' given  a compiled re, split a text 
    into matching and nonmatching sections
    returns m, n_m, two lists
    '''
    tokens = compiled_re.split(text)
    matches = tokens[1::2]
    non_matches = tokens[0::2]
    return matches,non_matches

m,nm = split_matches(text,ab_re)
''.join(nm) # equivalent to ab_re.sub('',text)

Gregg Lind 2008-10-15 15:00:40

Note that the compiled re must be a 'capturing re' with parens around the whole mess, or this won't work properly.

Gregg Lind 2008-10-15 15:38:22

Hmm? Works for me without parentheses.

Deestan 2008-10-17 07:33:23

Answer 4

+4 A:

What about this:

import re

text = "abcdedfe falijbijie bbbb laifsjelifjl"
matches = []

ab_re = re.compile( "[ab]" )

def verboseTest( m ):
    matches.append( m.group(0) )
    return ''

textWithoutMatches = ab_re.sub( verboseTest, text )

print matches
# ['a', 'b', 'a', 'b', 'b', 'b', 'b', 'b', 'a']
print textWithoutMatches
# cdedfe flijijie  lifsjelifjl

The 'repl' argument of the re.sub function can be a function so you can report or save the matches from there and whatever the function returns is what 'sub' will substitute.

The function could easily be modified to do a lot more too! Check out the re module documentation on docs.python.org for more information on what else is possible.

Jon Cage 2008-10-15 15:05:21

That's a very clever solution. I didn't know you could use functions as a first argument to the sub function.

Gregg Lind 2008-10-15 15:16:54

Thanks, I was rather chuffed with it's simplicity when I realised you could call a function :-)

Jon Cage 2008-10-15 16:57:58

ansaurus

tags:

views:

answers:

Capture the contents of a regex and delete them, efficiently.

related questions