tags:

views:

73

answers:

2

I have a text file that has sets of text I need to extract that looks something like as follows:

ITEM A blah blah blah ITEM B bloo bloo bloo ITEM A blee blee blee ITEM B

Here is the working code I have so far:

finda = r'(Item\sA)'
findb = r'(Item\sB)'
match_a = re.finditer(finda, usefile, 2)  # the "2" is a flag to say ignore case
match_b = re.finditer(findb, usefile, 2)

I know that I can use commands like span, start, and end to find the text positions of my matches. But I need to do this many times so what I need is:

  1. start writing at ITEM A and stop writing at ITEM B.
  2. if that first iteration is less than 50 characters long then discard and move to the next one
  3. once you find a set that starts with ITEM A and ends with ITEM B and is larger than 50 characters write it to a file

Thanks a ton in advance! I have been spinning my wheels for a while.

+2  A: 

why not just:

with open(fname, 'w') as file:
    for match in re.finditer(r'Item A(.+?)Item B', subject, re.I):
        s = match.group(1)
        if len(s) > 50:
            file.write(s)

Note: using actual numerical values of flags is rather oblique, use provided in re flags.

SilentGhost
You should use a look-ahead assertion for the end delimiter to allow overlapping of start and end delimiters.
Gumbo
Thanks! Once I figured out what all this meant I got it to work.
dandyjuan
+2  A: 

This can be done in a single regex:

with open("output.txt", "w") as f:
    for match in re.finditer(r"(?<=Item\sA)(?:(?!Item\sB).){50,}(?=Item\sB)", subject, re.I):
        f.write(match.group()+"\n")

This matches what is between Item A and Item B. Or did you want to match the delimiters, too?

The regex explained:

(?<=Item\sA)   # assert that we start our match right after "Item A"
(?:            # start repeated group (non-capturing)
  (?!Item\sB)  # assert that we're not running into "Item B"
  .            # then match any character
){50,}         # repeat this at least 50 times
(?=Item\sB)    # then assert that "Item B" follows next (without making it part of the match)
Tim Pietzcker
This is great code, but it's kind of complex and hard to figure out.
vy32
@vy32: I agree, and I have provided a free-spacing version of the regex to explain it better.
Tim Pietzcker