ansaurus

Question

regex in python, can this be improved upon?

Answer 1

+8 A:

The solution

You have two options:

Use non-capturing group: (?:@|#)\w+
Or even better, a character class: [@#]\w+

References

regular-expressions.info/Character Class and Groups

Understanding `findall`

The problem you were having is due to how findall return matches depending on how many capturing groups are present.

Let's take a closer look at this pattern (annotated to show the groups):

((@|#)\w+)
|\___/   |
|group 2 |     # Read about groups to understand
\________/     # how they're defined and numbered/named
 group 1

Capturing groups allow us to save the matches in the subpatterns within the overall patterns.

p = re.compile(r'((@|#)\w+)')
m = p.match('@tweet')
print m.group(1)
# @tweet
print m.group(2)
# @

Now let's take a look at the Python documentation for the re module:

findall: Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

This explains why you're getting the following:

str = 'lala @tweet boo #this &that @foo#bar'

print(re.findall(r'((@|#)\w+)', str))
# [('@tweet', '@'), ('#this', '#'), ('@foo', '@'), ('#bar', '#')]

As specified, since the pattern has more than one group, findall returns a list of tuples, one for each match. Each tuple gives you what were captured by the groups for the given match.

The documentation also explains why you're getting the following:

print(re.findall(r'(@|#)\w+', str))
# ['@', '#', '@', '#']

Now the pattern only has one group, and findall returns a list of matches for that group.

In contrast, the patterns given above as solutions doesn't have any capturing groups, which is why they work according to your expectation:

print(re.findall(r'(?:@|#)\w+', str))
# ['@tweet', '#this', '@foo', '#bar']

print(re.findall(r'[@#]\w+', str))
# ['@tweet', '#this', '@foo', '#bar']

References

Attachments

Snippet with output on ideone.com

polygenelubricants 2010-06-02 19:33:42

Thanks! Is this something that can be compiled and then reused in a pattern?

tipu 2010-06-02 19:36:26

@tipu - 'anything' can be compiled/reused, don't forget to accept the answer

KevinDTimm 2010-06-02 19:38:48

@tipu: I've updated the answer with more Python-specific information.

polygenelubricants 2010-06-03 08:56:18

ansaurus

tags:

views:

answers:

regex in python, can this be improved upon?

The solution

References

Understanding `findall`

References

Attachments

related questions

ansaurus

tags:

views:

answers:

regex in python, can this be improved upon?

The solution

References

Understanding findall

References

Attachments

related questions

Understanding `findall`