views:

263

answers:

5

Using Python I need to delete all charaters in a multiline string up to the first occurrence of a given pattern. In Perl this can be done using regular expressions with something like:

#remove all chars up to first occurrence of cat or dog or rat
$pattern = 'cat|dog|rat' 
$pagetext =~ s/(.*?)($pattern)/$2/xms; 

What's the best way to do it in Python?

+3  A: 
>>> import re
>>> s = 'hello cat!'
>>> m = re.search('cat|dog|rat', s)
>>> s[m.start():]
'cat!'

Of course you'll need to account for the case where there's no match in a real solution.

Or, more cleanly:

>>> import re
>>> s = 'hello cat!'
>>> p = 'cat|dog|rat'
>>> re.sub('.*?(?=%s)' % p, '', s, 1)
'cat!'

For multiline, use the re.DOTALL flag.

Max Shawabkeh
+1  A: 

Something like this should do what you want:

import re
text = '   sdfda  faf foo zing baz bar'
match = re.search('foo|bar', text)
if match:
  print text[match.start():] # ==>  'foo zing baz bar'
vezult
`re.search` is significantly faster than `re.sub` variant in this case http://stackoverflow.com/questions/2658101/delete-all-characters-in-a-multline-string-up-to-a-given-pattern/2661481#2661481
J.F. Sebastian
+2  A: 

non regex way

>>> s='hello cat!'
>>> pat=['cat','dog','rat']
>>> for n,i in enumerate(pat):
...     m=s.find(i)
...     if m != -1: print s[m:]
...
cat!
ghostdog74
`enumerate()` is unnecessary here http://stackoverflow.com/questions/2658101/delete-all-characters-in-a-multline-string-up-to-a-given-pattern/2661481#2661481
J.F. Sebastian
+3  A: 

You want to delete all characters preceding the first occurrence of a pattern; as an example, you give "cat|dog|rat".

Code that achieves this using re:

re.sub("(?s).*?(cat|dog|rat)", "\\1", input_text, 1)

or, if you'll be using again this regular expression:

rex= re.compile("(?s).*?(cat|dog|rat)")
result= rex.sub("\\1", input_text, 1)

Note the non-greedy .*?. The initial (?s) allows to match newline characters too, before the word matching.

Examples:

>>> input_text= "I have a dog and a cat"
>>> re.sub(".*?(cat|dog|rat)", "\\1", input_text, 1)
'dog and a cat'

>>> re.sub("(?s).*?(cat|dog|rat)", "\\1", input_text, 1)
'I have no animals!'

>>> input_text= "This is irrational"
>>> re.sub("(?s).*?(cat|dog|rat)", "\\1", input_text, 1)
'rational'

In case you want to do the conversion only for the words cat, dog and rat, you'll have to change the regex into:

>>> re.sub(r"(?s).*?\b(cat|dog|rat)\b", "\\1", input_text, 1)
'This is irrational'
ΤΖΩΤΖΙΟΥ
+1: Noticed the ungreediness and match limiting which I missed.
Max Shawabkeh
Your use of "" and r"" is inconsistent (e.g., it could be `r"\1"`)
Ian Bicking
@Ian Bicking: inconsistency is in the eye of the beholder. I almost always use r"" notation for strings that have more than one literal backslash; exceptions are unicode regular expressions containing \N{} character names.
ΤΖΩΤΖΙΟΥ
A: 

Another option is to use look ahead s/.*?(?=$pattern)//xs:

re.sub(r'(?s).*?(?=cat|dog|rat)', '', text, 1)

Non-regex way:

for option in 'cat dog rat'.split():
    index = text.find(option)
    if index != -1: # found
       text = text[index:]
       break

Non-regex way is almost 5 times faster (for some input):

$ python -mtimeit -s'from drop_until_word import drop_re, text, options;' \
> 'drop_re(text, options)'
1000 loops, best of 3: 1.06 msec per loop

$ python -mtimeit -s'from drop_until_word import drop_search, text, options;'\
> 'drop_search(text, options)'
10000 loops, best of 3: 184 usec per loop

$ python -mtimeit -s'from drop_until_word import drop_find, text, options;' \
> 'drop_find(text, options)'
1000 loops, best of 3: 207 usec per loop

Where drop_until_word.py is:

import re

def drop_re(text, options):
    return re.sub(r'(?s).*?(?='+'|'.join(map(re.escape, options))+')', '',
                  text, 1)

def drop_re2(text, options):
    return re.sub(r'(?s).*?('+'|'.join(map(re.escape, options))+')', '\\1',
                  text, 1)

def drop_search(text, options):
    m = re.search('|'.join(map(re.escape, options)), text)
    return text[m.start():] if m else text

def drop_find(text, options):
    indexes = [i for i in (text.find(option) for option in options) if i != -1]
    return text[min(indexes):] if indexes else text

text = open('/usr/share/dict/words').read()
options = 'cat dog rat'.split()

def test():
    assert drop_find(text, options) == drop_re(text, options) \
        == drop_re2(text, options) == drop_search(text, options)

    txt = 'dog before cat'
    r = txt
    for f in [drop_find, drop_re, drop_re2, drop_search]:
        assert r == f(txt, options), f.__name__


if __name__=="__main__":
    test()
J.F. Sebastian
Thanks. The none re version doesn't really do what I was after though i.e., remove all chars up to first occurrence of either cat or dog or rat. For example, if the string was "dog before cat" the re version would correctly return "dog before cat" whereas the find version would return just "cat".
bignum
@biffabacon: good catch. I've fixed `drop_find()` for the 'dog before cat' case.
J.F. Sebastian