views:

249

answers:

3

I'm suppose to capture everything inside a tag and the next lines after it, but it's suppose to stop the next time it meets a bracket. What am i doing wrong?

import re #regex

regex = re.compile(r"""
         ^                    # Must start in a newline first
         \[\b(.*)\b\]         # Get what's enclosed in brackets 
         \n                   # only capture bracket if a newline is next
         (\b(?:.|\s)*(?!\[))  # should read: anyword that doesn't precede a bracket
       """, re.MULTILINE | re.VERBOSE)

haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
@[this should be taken though as this is in the content]

[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m

what im trying to get is:
[('tab1', 'this is captured\nbut this is suppose to be captured too!\n@[this should be taken though as this is in the content]\n', '[tab2]','help me\nwrite a better RE\n')]

edit:

regex = re.compile(r"""
             ^           # Must start in a newline first
             \[(.*?)\]   # Get what's enclosed in brackets 
             \n          # only capture bracket if a newline is next
             ([^\[]*)    # stop reading at opening bracket
        """, re.MULTILINE | re.VERBOSE)

this seems to work but it's also trimming the brackets inside the content.

+3  A: 

Python regex doesn't support recursion afaik.

EDIT: but in your case this would work:

regex = re.compile(r"""
         ^           # Must start in a newline first
         \[(.*?)\]   # Get what's enclosed in brackets 
         \n          # only capture bracket if a newline is next
         ([^\[]*)    # stop reading at opening bracket
    """, re.MULTILINE | re.VERBOSE)

EDIT 2: yes, it doesn't work properly.

import re

regex = re.compile(r"""
    (?:^|\n)\[             # tag's opening bracket  
        ([^\]\n]*)         # 1. text between brackets
    \]\n                   # tag's closing bracket
    (.*?)                  # 2. text between the tags
    (?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
    """, re.DOTALL | re.VERBOSE)

haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag

[tag2]
help me
write a better RE[[[]
"""

print regex.findall(haystack)

I do agree with viraptor though. Regex are cool but you can't check your file for errors with them. A hybrid perhaps? :P

tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))

result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
    result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()

print result

EDIT 3: That's because ^ character means negative match only inside [^squarebrackets]. Everywhere else it means string start (or line start with re.MULTILINE). There's no good way for negative string matching in regex, only character.

Ivan Baldin
thanks for the response, i see, i've indeed tried the recursive (?R) but you're right it's not really working in python so do you know a way for me to make it so I can achieve what i'm trying to do?
cybervaldez
Im having a problem, it seems to stop when there's a bracket inside the content as well.How do i make it so it only stops when it founds a [ bracket on the start of the line only.[tab1]
cybervaldez
Thank you, this question of mine has been very informative as a lot of details and alternatives has appeared. I'm quite surprised on how things have become really different from your first solution. I haven't the idea as to why my solution didn't work: (^[\n\[]*), doesn't this read to stop when there's a [ bracket after a newline? why doesn't it work? this is just for food for thought, your answer works perfectly already.
cybervaldez
+2  A: 

Does this do what you want?

regex = re.compile(r"""
         ^                      # Must start in a newline first
         \[\b(.*)\b\]           # Get what's enclosed in brackets 
         \n                     # only capture bracket if a newline is next
         ([^[]*)
       """, re.MULTILINE | re.VERBOSE)

This gives a list of tuples (one 2-tuple per match). If you want a flattened tuple you can write:

m = sum(regex.findall(haystack), ())
Laurence Gonsalves
m = sum(regex.findall(haystack), ()) thanks for the tip!
cybervaldez
+3  A: 

First of all why a regex if you're trying to parse? As you can see you cannot find the source of the problem yourself, because regex gives no feedback. Also you don't have any recursion in that RE.

Make your life simple:

def ini_parse(src):
   in_block = None
   contents = {}
   for line in src.split("\n"):
      if line.startswith('[') and line.endswith(']'):
         in_block = line[1:len(line)-1]
         contents[in_block] = ""
      elif in_block is not None:
         contents[in_block] += line + "\n"
      elif line.strip() != "":
         raise Exception("content out of block")
   return contents

You get error handling with exceptions and the ability to debug execution as a bonus. Also you get a dictionary as a result and can handle duplicate sections while processing. My result:

{'tab2': 'help me\nwrite a better RE\n\n',
 'tab1': 'this is captured\nbut this is suppose to be captured too!\n@[this should be taken though as this is in the content]\n\n'}

RE is much overused these days...

viraptor
Yes, that is what a friend also suggested to me but I figured to do regex since it's going to help me a lot for future regexing(? pardon the word) and i've just started working with regex so if i can't make a simple parsing like this work then i'll probably never get to learn my way around regex. This is for my understanding of regex as well and from the looks of things i really need to learn it.
cybervaldez