The solution
You have two options:
- Use non-capturing group:
(?:@|#)\w+
- Or even better, a character class:
[@#]\w+
References
Understanding findall
The problem you were having is due to how findall
return matches depending on how many capturing groups are present.
Let's take a closer look at this pattern (annotated to show the groups):
((@|#)\w+)
|\___/ |
|group 2 | # Read about groups to understand
\________/ # how they're defined and numbered/named
group 1
Capturing groups allow us to save the matches in the subpatterns within the overall patterns.
p = re.compile(r'((@|#)\w+)')
m = p.match('@tweet')
print m.group(1)
# @tweet
print m.group(2)
# @
Now let's take a look at the Python documentation for the re
module:
findall
: Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
This explains why you're getting the following:
str = 'lala @tweet boo #this &that @foo#bar'
print(re.findall(r'((@|#)\w+)', str))
# [('@tweet', '@'), ('#this', '#'), ('@foo', '@'), ('#bar', '#')]
As specified, since the pattern has more than one group, findall
returns a list of tuples, one for each match. Each tuple gives you what were captured by the groups for the given match.
The documentation also explains why you're getting the following:
print(re.findall(r'(@|#)\w+', str))
# ['@', '#', '@', '#']
Now the pattern only has one group, and findall
returns a list of matches for that group.
In contrast, the patterns given above as solutions doesn't have any capturing groups, which is why they work according to your expectation:
print(re.findall(r'(?:@|#)\w+', str))
# ['@tweet', '#this', '@foo', '#bar']
print(re.findall(r'[@#]\w+', str))
# ['@tweet', '#this', '@foo', '#bar']
References
Attachments