tags:

views:

48

answers:

2

Here is an example of the type of text file I am trying to search (named usefile):

DOCK onomatopoeia DOCK blah blah
blah DOCK blah
DOCK
blah blah blah
onomatopoeia
blah blah blah
blah blah DOCK
DOCK blah blah
DOCK blah
onomatopoeia

I am using a finditer statement to find everything between DOCK and onomatopoeia as follows:

re.finditer(r'((dock)(.+?)(onomatopoeia))', usefile, re.I|re.DOTALL)

Obviously Dock is a much more common word than onomatopoeia and I only want to grab text between the first instance of Dock before onomatopoeia. The regex I am using above grabs text between the first instance of Dock and stops when it hits onomatopoeia, so I might get Dock Dock Dock Dock onomatopoeia when I really only wanted Dock onomatopoeia.

To be clear what I want from above is:
1. DOCK onomatopoeia
2. DOCK blah blah blah onomatopoeia
3. DOCK blah onomatopoeia

Is there a way to search for onomatopoeia and go UP to the first instance of Dock, or a better way to solve my problem?

Thanks!

+4  A: 

A negative lookahead assertion will do the trick.

DOCK((?!DOCK).)+?onomatopoeia
Daniel Brückner
Depending on specific use-case, may want to wrap DOCK in a pair of `\b` to ensure that, for example, "haddock" doesn't cause incorrect behaviour.
Peter Boughton
Great point Peter. Thanks for the answer Daniel!
dandyjuan
A: 

Here's an algorithmic approach:

  • set pushing==false.
  • Break your text apart into words (e.g. spans of letters) and loop over those.
  • upon hitting a DOCK and pushing==false, push it onto a stack and set pushing = true
  • if you hit ono... and pushing==true, print out whatever's on the stack plus ono..., then clear the stack and set pushing = false.
  • any other word, if pushing==true, push it.
  • DOCK, if pushing==true, clear the stack, then push your new DOCK.
Carl Smotricz
Thanks, but this seems quite complicated.
dandyjuan