ansaurus

Question

My regex in python isn't recursing properly

Answer 1

+3 A:

Python regex doesn't support recursion afaik.

EDIT: but in your case this would work:

regex = re.compile(r"""
         ^           # Must start in a newline first
         \[(.*?)\]   # Get what's enclosed in brackets 
         \n          # only capture bracket if a newline is next
         ([^\[]*)    # stop reading at opening bracket
    """, re.MULTILINE | re.VERBOSE)

EDIT 2: yes, it doesn't work properly.

import re

regex = re.compile(r"""
    (?:^|\n)\[             # tag's opening bracket  
        ([^\]\n]*)         # 1. text between brackets
    \]\n                   # tag's closing bracket
    (.*?)                  # 2. text between the tags
    (?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
    """, re.DOTALL | re.VERBOSE)

haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag

[tag2]
help me
write a better RE[[[]
"""

print regex.findall(haystack)

I do agree with viraptor though. Regex are cool but you can't check your file for errors with them. A hybrid perhaps? :P

tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))

result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
    result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()

print result

EDIT 3: That's because ^ character means negative match only inside [^squarebrackets]. Everywhere else it means string start (or line start with re.MULTILINE). There's no good way for negative string matching in regex, only character.

Ivan Baldin 2009-06-05 09:24:39

thanks for the response, i see, i've indeed tried the recursive (?R) but you're right it's not really working in python so do you know a way for me to make it so I can achieve what i'm trying to do?

cybervaldez 2009-06-05 09:29:40

Im having a problem, it seems to stop when there's a bracket inside the content as well.How do i make it so it only stops when it founds a [ bracket on the start of the line only.[tab1]

cybervaldez 2009-06-06 11:40:19

Thank you, this question of mine has been very informative as a lot of details and alternatives has appeared. I'm quite surprised on how things have become really different from your first solution. I haven't the idea as to why my solution didn't work: (^[\n\[]*), doesn't this read to stop when there's a [ bracket after a newline? why doesn't it work? this is just for food for thought, your answer works perfectly already.

cybervaldez 2009-06-07 00:41:35

Answer 2

+2 A:

Does this do what you want?

regex = re.compile(r"""
         ^                      # Must start in a newline first
         \[\b(.*)\b\]           # Get what's enclosed in brackets 
         \n                     # only capture bracket if a newline is next
         ([^[]*)
       """, re.MULTILINE | re.VERBOSE)

This gives a list of tuples (one 2-tuple per match). If you want a flattened tuple you can write:

m = sum(regex.findall(haystack), ())

Laurence Gonsalves 2009-06-05 09:32:38

m = sum(regex.findall(haystack), ()) thanks for the tip!

cybervaldez 2009-06-07 00:43:14

Answer 3

+3 A:

First of all why a regex if you're trying to parse? As you can see you cannot find the source of the problem yourself, because regex gives no feedback. Also you don't have any recursion in that RE.

Make your life simple:

def ini_parse(src):
   in_block = None
   contents = {}
   for line in src.split("\n"):
      if line.startswith('[') and line.endswith(']'):
         in_block = line[1:len(line)-1]
         contents[in_block] = ""
      elif in_block is not None:
         contents[in_block] += line + "\n"
      elif line.strip() != "":
         raise Exception("content out of block")
   return contents

You get error handling with exceptions and the ability to debug execution as a bonus. Also you get a dictionary as a result and can handle duplicate sections while processing. My result:

{'tab2': 'help me\nwrite a better RE\n\n',
 'tab1': 'this is captured\nbut this is suppose to be captured too!\n@[this should be taken though as this is in the content]\n\n'}

RE is much overused these days...

viraptor 2009-06-06 12:15:02

Yes, that is what a friend also suggested to me but I figured to do regex since it's going to help me a lot for future regexing(? pardon the word) and i've just started working with regex so if i can't make a simple parsing like this work then i'll probably never get to learn my way around regex. This is for my understanding of regex as well and from the looks of things i really need to learn it.

cybervaldez 2009-06-07 00:20:11

ansaurus

tags:

views:

answers:

My regex in python isn't recursing properly

related questions