ansaurus

Question

Python: defining a union of regular expressions

Answer 1

+2 A:

list_regexs = [re.compile(x) for x in list_patterns]

Ignacio Vazquez-Abrams 2010-07-18 01:56:38

Thank you, but I see that this will be list of compiled regular expressions, I am looking for one compiled regular expression for all elements in list

learner 2010-07-18 02:02:38

@learner: This is important information that you should consider adding to the question. Preferably *before* answers come in.

Ignacio Vazquez-Abrams 2010-07-18 02:03:26

It is worth noting that the last 100 regexps that module `re` compiles are memoized (and then oddly, rather than keeping an MRU or LRU cache on regexp number 101, the first 100 are thrown away (see your local `re.py`, offer void in Manitoba)).

msw 2010-07-18 02:06:35

@msw: nice information, thanks for sharing :)@Ignacio: I apologize for that, will keep in mind

learner 2010-07-18 08:33:40

Answer 2

+1 A:

You want a pattern that matches any item in the list? Wouldn't that just be:

': error:|: warning:|cc1plus:|undefine reference to'?

Or, in Python code:

re.compile("|".join(list_patterns))

darkporter 2010-07-18 01:59:26

For simple strings this would probably work (although make for a hideous NFA). In the general case with arbitrary regexps (e.g. `['f|g', 'f*g']` I'm pretty sure it would break.

msw 2010-07-18 02:14:33

@msw: Assuming well-balanced regexes, you should be able to do something like this: `re.compile("|".join(['(?:%s)' % p for p in list_patterns]))`

Laurence Gonsalves 2010-07-18 02:21:06

Answer 3

+3 A:

There are a couple of ways of doing this. The simplest is:

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = 'there is an : error: and a cc1plus: in this string'
print re.findall('|'.join(list_patterns), string)

Output:

[': error:', 'cc1plus:']

which is fine as long as concatenating your search patterns doesn't break the regex (eg if one of them contains a regex special character like an opening parenthesis). You can handle that this way:

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = 'there is an : error: and a cc1plus: in this string'
pattern = "|".join(re.escape(p) for p in list_patterns)
print re.findall(pattern, string)

Output is the same. But what this does is pass each pattern through re.escape() to escape any regex special characters.

Now which one you use depends on your list of patterns. Are they regular expressions and can thus be assumed to be valid? If so, the first would probably be appropriate. If they are strings, use the second method.

For the first, it gets more complicated however because by concatenating several regular expressions you may change the grouping and have other unintended side effects.

cletus 2010-07-18 02:18:14

This assumes that list_patterns has literal strings to match, not regexes. (That may or may not be what the OP wanted... the question is kind of unclear on that.)

Laurence Gonsalves 2010-07-18 02:22:11

`str.join` takes an iterable as input, not a list. So `"|".join(re.escape(p) for p in list_patterns)` is enough.

KennyTM 2010-07-18 07:22:37

@Laurence I've further qualified that statement. @KennyTM I've updated accordingly.

cletus 2010-07-19 00:22:11

@cletusThank you for wonderful exaplanation! that helps

learner 2010-07-19 21:16:49

Answer 4

A:

How about

ptrn = re.compile('|'.join(re.escape(e) for e in list_patterns))

Note the use of re.escape() to avoid unintended consequences by presence of characters like ()[]|.+* etc in some of the strings. Assuming you want that, otherwise skip the escape().

It also depends how do you intend to 'consume' that expression - is it only for search of a match or would you like to collect the matching groups back?

Nas Banov 2010-07-18 02:40:50

Answer 5

A:

a regular expression that matches every element in the list

I see you've already got several answers based on the assumption that by "matches every element in the list" you actually meant "matches any element in the list" (the answers in questions are based on the | "or" operator of regular expressions).

If you actually do want a RE to match every element of the list (as opposed to any single such element), then you might want to match them either in the same order as the list gives them (easy), or, in any order whatever (hard).

For in-order matching, '.*?'.join(list_patterns) should serve you well (if the items are indeed to be taken as RE patterns - if they're to be taken as literal strings instead, '.*?'.join(re.escape(p) for p list_patterns)).

For any-order matching, regular expressions, per se, offer no direct support. You could take all permutations of the list (e.g. with itertools.permutations), join up each of them with '.*?', and join the whole with | -- but that can produce a terribly long RE pattern as a result, because the number of permutations of N items is N! ("N factorial" -- for example, for N equal 4, the permutations are 4 * 3 * 2 * 1 == 24). Performance may therefore easily suffer unless the number of items in the list is known to be very, very small.

For a more general solution to the "match every item in arbitrary order" problem (if that's what you need), one with a performance and memory footprint that's still acceptable for decently large lengths of list, you need to give up on the target of making it all work with a single RE object, and inject some logic in the mix -- for example, make a list of RE objects with relist=[re.compile(p) for p in list_patterns], and check "they all match string s, in any order" with all(r.search(s) for r in relist) or the like.

Of course, if you need to make this latest approach work in a "duck-typing compatible way" with actual RE objects, that's not hard, e.g., if all you need is a search method which returns a boolean result (since returning a "match object" would make no sense)...:

class relike(object):
    def __init__(self, list_patterns):
        self.relist = [re.compile(p) for p in list_patterns]
    def search(self, s):
        return all(r.search(s) for r in relist)

Alex Martelli 2010-07-18 03:33:54

Answer 6

A:

Cletus gave a very good answer. If however one of the strings to match could be a substring of another, then you would to reverse sort the strings first so that shortest matches do not obscure longer ones.

If, as Alex has noted, the original poster wanted what he actually asked for, then a more tractable solution than using permutations might be to:

Remove any duplicates in list_patterns. (It could be better off starting with a set then turning it into a reverse-sorted list without duplicates).
re.escape() the items of the list.
Surround each item in individually a group (... ).
'|'.join() all the groups.
Find the set of the indices of all groups that matched, and compare its length with len(list_patterns).

If there is at least one match for every entry in your original list of strings, then the length of the set should match.

The code would be something like:

import re

def usedgroupindex(indexabledata):
    for i,datum in enumerate(indexabledata):
        if datum: return i
    # return None

def findallstrings(list_patterns, string):
    lp = sorted(set(list_patterns), reverse=True)
    pattern = "|".join("(%s)" % re.escape(p) for p in lp)
    # for m in re.findall(pattern, string): print (m, usedgroupindex(m))
    return ( len(set(usedgroupindex(m) for m in re.findall(pattern, string)))
             == len(lp) )

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = ' XZX '.join(list_patterns)

print ( findallstrings(list_patterns, string) )

Paddy3118 2010-07-18 06:35:07

ansaurus

tags:

views:

answers:

Python: defining a union of regular expressions

related questions