ansaurus

Question

Delete all characters in a multline string up to a given pattern

Answer 1

+3 A:

>>> import re
>>> s = 'hello cat!'
>>> m = re.search('cat|dog|rat', s)
>>> s[m.start():]
'cat!'

Of course you'll need to account for the case where there's no match in a real solution.

Or, more cleanly:

>>> import re
>>> s = 'hello cat!'
>>> p = 'cat|dog|rat'
>>> re.sub('.*?(?=%s)' % p, '', s, 1)
'cat!'

For multiline, use the re.DOTALL flag.

Max Shawabkeh 2010-04-17 10:58:03

Answer 2

+1 A:

Something like this should do what you want:

import re
text = '   sdfda  faf foo zing baz bar'
match = re.search('foo|bar', text)
if match:
  print text[match.start():] # ==>  'foo zing baz bar'

vezult 2010-04-17 11:01:57

`re.search` is significantly faster than `re.sub` variant in this case http://stackoverflow.com/questions/2658101/delete-all-characters-in-a-multline-string-up-to-a-given-pattern/2661481#2661481

J.F. Sebastian 2010-04-18 07:59:01

Answer 3

+2 A:

non regex way

>>> s='hello cat!'
>>> pat=['cat','dog','rat']
>>> for n,i in enumerate(pat):
...     m=s.find(i)
...     if m != -1: print s[m:]
...
cat!

ghostdog74 2010-04-17 11:29:47

`enumerate()` is unnecessary here http://stackoverflow.com/questions/2658101/delete-all-characters-in-a-multline-string-up-to-a-given-pattern/2661481#2661481

J.F. Sebastian 2010-04-18 07:36:16

Answer 4

+3 A:

You want to delete all characters preceding the first occurrence of a pattern; as an example, you give "cat|dog|rat".

Code that achieves this using re:

re.sub("(?s).*?(cat|dog|rat)", "\\1", input_text, 1)

or, if you'll be using again this regular expression:

rex= re.compile("(?s).*?(cat|dog|rat)")
result= rex.sub("\\1", input_text, 1)

Note the non-greedy .*?. The initial (?s) allows to match newline characters too, before the word matching.

Examples:

>>> input_text= "I have a dog and a cat"
>>> re.sub(".*?(cat|dog|rat)", "\\1", input_text, 1)
'dog and a cat'

>>> re.sub("(?s).*?(cat|dog|rat)", "\\1", input_text, 1)
'I have no animals!'

>>> input_text= "This is irrational"
>>> re.sub("(?s).*?(cat|dog|rat)", "\\1", input_text, 1)
'rational'

In case you want to do the conversion only for the words cat, dog and rat, you'll have to change the regex into:

>>> re.sub(r"(?s).*?\b(cat|dog|rat)\b", "\\1", input_text, 1)
'This is irrational'

ΤΖΩΤΖΙΟΥ 2010-04-17 12:28:26

+1: Noticed the ungreediness and match limiting which I missed.

Max Shawabkeh 2010-04-17 12:32:12

Your use of "" and r"" is inconsistent (e.g., it could be `r"\1"`)

Ian Bicking 2010-04-17 22:37:16

@Ian Bicking: inconsistency is in the eye of the beholder. I almost always use r"" notation for strings that have more than one literal backslash; exceptions are unicode regular expressions containing \N{} character names.

ΤΖΩΤΖΙΟΥ 2010-04-18 00:19:56

Answer 5

A:

Another option is to use look ahead s/.*?(?=$pattern)//xs:

re.sub(r'(?s).*?(?=cat|dog|rat)', '', text, 1)

Non-regex way:

for option in 'cat dog rat'.split():
    index = text.find(option)
    if index != -1: # found
       text = text[index:]
       break

Non-regex way is almost 5 times faster (for some input):

$ python -mtimeit -s'from drop_until_word import drop_re, text, options;' \
> 'drop_re(text, options)'
1000 loops, best of 3: 1.06 msec per loop

$ python -mtimeit -s'from drop_until_word import drop_search, text, options;'\
> 'drop_search(text, options)'
10000 loops, best of 3: 184 usec per loop

$ python -mtimeit -s'from drop_until_word import drop_find, text, options;' \
> 'drop_find(text, options)'
1000 loops, best of 3: 207 usec per loop

Where drop_until_word.py is:

import re

def drop_re(text, options):
    return re.sub(r'(?s).*?(?='+'|'.join(map(re.escape, options))+')', '',
                  text, 1)

def drop_re2(text, options):
    return re.sub(r'(?s).*?('+'|'.join(map(re.escape, options))+')', '\\1',
                  text, 1)

def drop_search(text, options):
    m = re.search('|'.join(map(re.escape, options)), text)
    return text[m.start():] if m else text

def drop_find(text, options):
    indexes = [i for i in (text.find(option) for option in options) if i != -1]
    return text[min(indexes):] if indexes else text

text = open('/usr/share/dict/words').read()
options = 'cat dog rat'.split()

def test():
    assert drop_find(text, options) == drop_re(text, options) \
        == drop_re2(text, options) == drop_search(text, options)

    txt = 'dog before cat'
    r = txt
    for f in [drop_find, drop_re, drop_re2, drop_search]:
        assert r == f(txt, options), f.__name__


if __name__=="__main__":
    test()

J.F. Sebastian 2010-04-18 07:34:29

Thanks. The none re version doesn't really do what I was after though i.e., remove all chars up to first occurrence of either cat or dog or rat. For example, if the string was "dog before cat" the re version would correctly return "dog before cat" whereas the find version would return just "cat".

bignum 2010-04-18 08:04:46

@biffabacon: good catch. I've fixed `drop_find()` for the 'dog before cat' case.

J.F. Sebastian 2010-04-18 08:16:44

ansaurus

tags:

views:

answers:

Delete all characters in a multline string up to a given pattern

related questions