tags:

views:

65

answers:

2

Title says it all. I looked through the related questions, there were quite a few but I don't think any answered this question. I am very new to Regex but I'm trying to get better so bear with me please. I am trying to match several groups in a string, but in any order. Is this something I should be using Regex for? If so, how? If it matters, I plan to use these in IronPython.

EDIT: Someone asked me to be more specific, so here:

I want to use re.match with a regex like:

\[image\s*(?(@alt:(?<alt>.*?);).*(@title:(?<title>.*?);))*.*\](?<arg>.*?)\[\/image\]

But it will only match the named groups when they are in the right order, and separated with a space. I would like to be able to match the named groups in any order, as long as they appear where they do now in the regex.

A typical string that will be applied to this might look like:

[image @alt:alien; @title:reddit alien;]http://www.reddit.com/alien.png[/image]

But I should have no problem matching:

[image @title:reddit alien; @alt:alien;]http://www.reddit.com/alien.png[/image]

So the 'attributes' (things that come between '@' and ';' in the first 'tag') should be matched in any order, as long as they both appear.

+1  A: 

The answer to the question in your title is "no" -- to match N groups "in any order", the regex should have an "or" (the | feature in the regex pattern) among the N! (N factorial) possible permutations of the groups, the product of all integers from 1 to N. That's a number which grows extremely fast -- for N just equal 6, it's already 720, for 7, it's almost 5000, and so on at a dizzying pace -- so this approach is totally impractical for any N which isn't really tiny.

The solutions may be many, depending on what you want the groups to be separated with. Let's say, for example, that you don't care (if you DO care, edit your question with better specs).

In this case, if overlapping matches are impossible or are OK with you, make N separate regular expressions, one per group -- say these N compiled RE objects are in a list named grps, then

mos = [g.search(thestring) for g in grps]

is the list of match objects for the groups (None for a group which doesn't match). With the mos list you can do all sorts of checks and/or further manipulations, for example all(mos) is True if and only if all the groups matched, in which case [m.group() for m in mos] is the list of substrings that have been matched, and so on, and so forth.

If you need non-overlapping matches, it's a bit more complicated -- you may extract the boundaries of all possible matches for each group, then seeing if there's a way to extract from these N lists a set of N intervals, one per lists, so that no two of them are pairwise intersecting. This is a somewhat subtle algorithm (if you want reasonable speed for a large N, of course), so I think it's worth a separate question, and in any case it's not worth discussing right here when the very issue of whether it's needed or not depends on so incredibly many factors that you have not specified.

So, please edit your question with more precise specifications, first, and then things can perhaps be clarified to provide you with the code and/or algorithms you need.

Edit: I see the OP has now clarified the issue at least of the extent of providing an example -- although, confusingly, he offers a RE pattern example and a string example that should not match, regardless of ordering (the RE specifies the presence of a substring @title which the example string does not have -- puzzling!).

Anyway, if the number of groups in the example (two which appear to be interchangeable, one which appears to have to occur in a specific spot) is representative of the OP's actual problems, then the total number of permutations of interest is just two, so joining the "just two" permutations with a vertical bar | would of course be quite feasible. Is that the case in the OP's real problems, though...?

Edit: if the number of permutations of interest is tiny, here's an example of one way to avoid the problem of repeated group names in the pattern (syntax requires Python 2.7 or better, but that's just for the final "dict comprehension" -- the same functionality is available in many previous version of Python, just with the less elegant dict(('a', ... syntax;-)...:

>>> r = re.compile(r'(?P<a1>a.*?a).*?(?P<b1>b.*?b)|(?P<b2>b.*?b).*?(?P<a2>a.*?a)')
>>> m = r.search('zzzakkkavvvbxxxbnnn')
>>> g = m.groupdict()
>>> d = {'a':(g.get('a1') or g.get('a2')), 'b':(g.get('b1') or g.get('b2'))}
>>> d
{'a': 'akkka', 'b': 'bxxxb'}
Alex Martelli
Edited again. So sorry for the confusion.
cory
@cory, given your latest edit the general approach of "a then b OR b then a" and the above trick to avoid duplicated names should work (use non-grouping parentheses `(?:...)` around the whole vbar-separated list of groups permutations of course), as long as you have no more than, say, three groups of interest (six permutations). Beyond that, the suggestion (implicit in a comment that was since deleted) to take an XML or HTML parser (which has similar requirements for identifying a tag's attributes), and hack it to use your peculiar required syntax, appears more alluring to me;-).
Alex Martelli
A: 

This is very similar to one of the key problems with using regular expressions to parse HTML - there is no requirement that attributes always be specified in the same order, and many tags have surprising attributes (like <br clear="all">. So it seems you are working with a very similar markup syntax.

Pyparsing addresses this problem in an indirect way - instead of trying to parse all different permutations, parse the general "@attrname:attribute value;" syntax, and keep track of the attributes keys and values in an attribute mapping data structure. The mapping makes it easy to get the "title" attribute, regardless of whether it came first or last in the image tag. This behavior is built into the pyparsing API methods, makeHTMLTags and makeXMLTags.

Of course, this markup is not XML, but a similar approach gives some pretty easy to work with results:

text = """[image @alt:alien; @title:reddit alien;]http://www.reddit.com/alien1.png[/image]

But I should have no problem matching:

[image @title:reddit alien; @alt:alien;]http://www.reddit.com/alien2.png[/image]
"""

from pyparsing import Suppress, Group, Word, alphas, SkipTo, Dict, ZeroOrMore

LBRACK,RBRACK,COLON,SEMI,AT = map(Suppress,"[]:;@")
tagAttribute = Group(AT + Word(alphas) + COLON + SkipTo(SEMI) + SEMI)
imageTag = LBRACK + "image" + Dict(ZeroOrMore(tagAttribute)) + RBRACK
imageLink = imageTag + SkipTo("[/image]")("text")

for taginfo in imageLink.searchString(text):
    print taginfo.alt
    print taginfo.title
    print taginfo.text
    print

Prints:

alien
reddit alien
http://www.reddit.com/alien1.png

alien
reddit alien
http://www.reddit.com/alien2.png
Paul McGuire