ansaurus

Question

Answer 1

+1 A:

I would use re.finditer and build a basic parse tree to determine the nesting level. To do it, I would use the span attribute of the regex match object:

text = '{{Hello|Hi|Hey} {world|earth} | {Goodbye|farewell} {noobs|n3wbz|n00blets}}'

import re
re_bracks = re.compile(r'{.+?}')

# subclass list for a basic tree datatype
class bracks(list):
    def __init__(self, m):
        self.m = m

# icky procedure to create the parse tree
# I hate these but don't know how else to do it
parse_tree = []
for m in re_bracks.finditer(text):
    if not this_element:
        # this first match
        parse_tree.extend(element(m))
    else:
        # ... and all the rest
        this_element = bracks(m)
        this_start, this_end = m.span()

        # if this match is nested in the old one ...
        if this_start < previous_start and this_end > previous_end:
            # nest it inside the previous one
            previous_element.extend(this_element) 
        else:
            # otherwise make it a child of the parse_tree
            parse_tree.extend(element(m))

        previous_element = this_element
        previous_start, previous_end = this_start, this_end

This would give you the nesting depth of the bracketed expressions. Add some similar logic for the pipes and you'd be well on your way to solving the problem.

twneale 2009-11-27 23:05:29

Oops, realized this regex wouldn't work.

twneale 2009-11-28 01:17:55

Answer 2

+1 A:

I'd recommend taking a look at the dada engine for inspiration.

I've done an implementation of something inspired by this in scheme and leveraged scheme's AST to express my needs.

Specifically, I'd recommend strongly against trying to use a regex as a parser in general.

Dustin 2009-11-27 23:19:26

Answer 3

+3 A:

A simple way with re.subn, which can also accept a function instead of a replacement string:

import re
from random import randint

def select(m):
    choices = m.group(1).split('|')
    return choices[randint(0, len(choices)-1)]

def spinner(s):
    r = re.compile('{([^{}]*)}')
    while True:
        s, n = r.subn(select, s)
        if n == 0: break
    return s.strip()

It simply replaces all the deepest choices it meets, then iterates until no choice remains. subn returns a tuple with the result and how many replacements were made, which is convenient to detect the end of the processing.

My version of select() can be replaced by Bobince's that uses random.choice() and is more elegant if you just want to stick to a random selector. If you want to build a choice tree, you could extend the above function, but you will need global variables to keep track of where you are, so moving the functions into a class would make sense. This is just a hint, I won't develop that idea since it was not really the orginial question.

Note finally that you should use r.subn(select, s, re.U) if you need unicode strings (s = u"{...}")

Example:

>>> s = "{{Hello|Hi|Hey} {world|earth} | {Goodbye|farewell} {noobs|n3wbz|n00blets}}"
>>> print spinner(s)
'farewell n3wbz'

Edit: Replaced sub by subn to avoid infinite loop (thanks to Bobince to point it out) and make it more efficient, and replaced {([^{}]+)} by {([^{}]*)} to extract empty curly brackets as well. That should make it more robust to ill-formatted patterns.

For people who like to put as much as possible on one line (which I personally wouldn't encourage):

def spin(s):
    while True:
        s, n = re.subn('{([^{}]*)}',
                       lambda m: random.choice(m.group(1).split("|")),
                       s)
        if n == 0: break
    return s.strip()

RedGlyph 2009-11-27 23:56:47

snap (almost)! the `{` test would go into an infinite loop if there was an unmatched `{` in the input, though.

bobince 2009-11-28 00:10:50

You're right, I fixed the code and made the test unnecessary.

RedGlyph 2009-11-28 09:55:00

doh, and I hadn't spotted subn, either! Well, we've ended up with an acceptable program between us :-)

bobince 2009-11-28 11:58:55

Answer 4

+3 A:

Should be fairly simple, just disallow a brace set from including another, then repeatedly call doing replacements from the inner matches outwards:

def replacebrace(match):
    return random.choice(match.group(1).split('|'))

def randomizebraces(s):
   while True:
       s1= re.sub(r'\{([^{}]*)\}', replacebrace, s)
       if s1==s:
           return s
       s= s1

>>> randomizebraces('{{Hello|Hi|Hey} {world|earth}|{Goodbye|farewell} {noobs|n3wbz|n00blets}}')
'Hey world'
>>> randomizebraces('{{Hello|Hi|Hey} {world|earth}|{Goodbye|farewell} {noobs|n3wbz|n00blets}}')
'Goodbye noobs'

bobince 2009-11-28 00:09:37

+1: nice, I didn't know random.choice. I did hesitate to use the string comparison but opted for the simple bracket search/compiled regexp to make it a bit faster (but less safe).

RedGlyph 2009-11-28 09:11:25

Answer 5

+1 A:

This regex inverter uses pyparsing to generate matching strings (with some restrictions - unlimited repetition symbols like + and * are not allowed). If you replace {}'s with ()'s to make your original string into a regex, the inverter generates this list:

Helloworld
Helloearth
Hiworld
Hiearth
Heyworld
Heyearth
Goodbyenoobs
Goodbyen3wbz
Goodbyen00blets
farewellnoobs
farewelln3wbz
farewelln00blets

(I know the spaces are collapsed out, but maybe this code will give you some ideas on how to attack this problem.)

Paul McGuire 2009-11-28 04:15:33

ansaurus

tags:

views:

answers:

How to make article spinner regex?

related questions