tags:

views:

60

answers:

3

Hi SO,

I can't seem to create the correct regular expression to extract the correct tokens from my string. Padding the beginning of the string with a space generates the correct output, but seems less than optimal:

>>> import re
>>> s = '-edge_0triggered a-b | -level_Sensitive c-d | a-b-c'
>>> re.findall(r'\W(-[\w_]+)',' '+s)
['-edge_0triggered', '-level_Sensitive'] # correct output

Here are some of the regular expressions I've tried, does anyone have a regex suggestion that doesn't involve changing the original string and generates the correct output

>>> re.findall(r'(-[\w_]+)',s)
['-edge_0triggered', '-b', '-level_Sensitive', '-d', '-b', '-c']
>>> re.findall(r'\W(-[\w_]+)',s)
['-level_Sensitive']

Thanks -- DW

+1  A: 

Change the first qualifier to accept either a beginning anchor or a not-word, instead of only a not-word:

>>> re.findall(r'(?:^|\W)(-[\w_]+)', s)
['-edge_0triggered', '-level_Sensitive']

The ?: at the beginning of the group simply tells the regex engine to not treat that as a group for purposes of results.

Mark Rushakoff
Brilliant, thanks Mark. You get the check, though I'm going to use Ignacio's solution because it's shorter.
dlw
@dlw: you seem to be confused about what the check means. It doesn't mean “this answer was the fastest correct one”, it means “that's the answer that I'm gonna use”. You should check Ignacio's answer.
ΤΖΩΤΖΙΟΥ
Sorry Mark, check goes to Ignacio
dlw
+1  A: 
r'(?:^|\W)(-\w+)'

\w already includes the underscore.

Ignacio Vazquez-Abrams
A: 

You could use a negative-lookbehind:

re.findall(r'(?<!\w)(-\w+)', s)

the (?<!\w) part means "match only if not preceded by a word-character".

Alex Martelli