tags:

views:

74

answers:

5

Hello everyone,

Three underscore separated elements make my strings : - first (letters and digits) - middle (letters, digits and underscore) - last (letters and digits)

The last element is optional.

Note : I need to access my groups by their names, not their indices.

Examples :

String : abc_def
first : abc
middle : def
last : None

String : abc_def_xyz
first : abc
middle: def
last: xyz

String : abc_def_ghi_jkl_xyz
first : abc
middle : def_ghi_jkl
last : xyz

I can't find the right regex...

I have two ideas so far :

Optional group

(?P<first>[a-z]+)_(?P<middle>\w+)(_(?P<last>[a-z]+))?

But the middle group matches until the end of the string :

String : abc_def_ghi_jkl_xyz
first : abc
middle : def_ghi_jkl_xyz
last : vide

Using the '|'

(?P<first>[a-z]+)_(?P<middle>\w+)_(?P<last>[a-z]+)|(?P<first>[a-z]+)_(?P<middle>\w+)

This expression is invalid : first and middle groups are declared two times. I though I could write an expression reusing the matched group from the first part of the expression :

(?P<first>[a-z]+)_(?P<middle>\w+)_(?P<last>[a-z]+)|(?P=first)_(?P=middle)

The expression is valid, however strings with just a first and a middle like abc_def are not matched.

Note

These strings are actually parts of a path I need to match. It could be paths like :

  • /my/path/to/abc_def
  • /my/path/to/abc_def/
  • /my/path/to/abc_def/some/other/stuf
  • /my/path/to/abc_def/some/other/stuf/
  • /my/path/to/abc_def_ghi_jkl_xyz
  • /my/path/to/abc_def_ghi_jkl_xyz/
  • /my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf
  • /my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf/
  • ...

Any idea to solve my problem solely with regular expressions ? Post-processing the matched groups is not an option.

Thank you very much !

+3  A: 

Change the middle group to be non-greedy, and add beginning and end-of-string anchors:

^(?P<first>[a-z]+)_(?P<middle>\w+?)(_(?P<last>[a-z]+))?$

By default, the \w+will match as much as possible, which eats the rest of the string. Adding the ? tells it to match as little as possible.

Thanks to Tim Pietzcker for pointing out the anchor requirements.

Graeme Perrow
This won't work without anchors. In `abc_def_ghi_jkl_xyz`, `first` will match `abc`, `middle` will match `d`, and `last` will be empty.
Tim Pietzcker
@Tim: You're right. I've updated my answer. Thanks
Graeme Perrow
+1  A: 

Use

^(?P<first>[a-z]+)_(?P<middle>\w+?)(_(?P<last>[a-z]+))?$

^ and $ anchor the regex at start and end of the string.

Making the \w+? lazy allows it to match as little as possible (but at least one character).

EDIT:

For your changed requirements that now include paths before and after this string, this works:

^(.*?/)(?P<first>[a-z]+)_(?P<middle>\w+?)(_(?P<last>[a-z]+))?(/.*)?$

Code sample (Python 3.1):

import re
paths = ["/my/path/to/abc_def",
         "/my/path/to/abc_def/",
         "/my/path/to/abc_def/some/other/stuf",
         "/my/path/to/abc_def/some/other/stuf/",
         "/my/path/to/abc_def_ghi_jkl_xyz",
         "/my/path/to/abc_def_ghi_jkl_xyz/",
         "/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf",
         "/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf/"]

regex = re.compile(r"^(.*?/)(?P<first>[a-z]+)_(?P<middle>\w+?)(_(?P<last>[a-z]+))?(/.*)?$")

for path in paths:
    match = regex.match(path)
    print ("{}:\nBefore: {}\nFirst: {}\nMiddle: {}\nLast: {}\nAfter: {}\n".format(
           path, match.group(1), match.group("first"), match.group("middle"),
           match.group("last"), match.group(6)))

Output:

/my/path/to/abc_def:
Before: /my/path/to/
First: abc
Middle: def
Last: None
After: None

/my/path/to/abc_def/:
Before: /my/path/to/
First: abc
Middle: def
Last: None
After: /

/my/path/to/abc_def/some/other/stuf:
Before: /my/path/to/
First: abc
Middle: def
Last: None
After: /some/other/stuf

/my/path/to/abc_def/some/other/stuf/:
Before: /my/path/to/
First: abc
Middle: def
Last: None
After: /some/other/stuf/

/my/path/to/abc_def_ghi_jkl_xyz:
Before: /my/path/to/
First: abc
Middle: def_ghi_jkl
Last: xyz
After: None

/my/path/to/abc_def_ghi_jkl_xyz/:
Before: /my/path/to/
First: abc
Middle: def_ghi_jkl
Last: xyz
After: /

/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf:
Before: /my/path/to/
First: abc
Middle: def_ghi_jkl
Last: xyz
After: /some/other/stuf

/my/path/to/abc_def_ghi_jkl_xyz/some/other/stuf/:
Before: /my/path/to/
First: abc
Middle: def_ghi_jkl
Last: xyz
After: /some/other/stuf/
Tim Pietzcker
A: 

Try this regular expression:

^(?P<first>[a-z]+)_(?P<middle>[a-z]+(?:_[a-z]+)*?)(?:_(?P<last>[a-z]+))?$

Here’s a test case:

import re

strings = ['abc_def', 'abc_def_xyz', 'abc_def_ghi_jkl_xyz']
pattern = '^(?P<first>[a-z]+)_(?P<middle>[a-z]+(?:_[a-z]+)*?)(?:_(?P<last>[a-z]+))?$'
for string in strings:
    m = re.match(pattern, string)
    print m.groupdict()

The output is:

{'middle': 'def', 'last': None, 'first': 'abc'}
{'middle': 'def', 'last': 'xyz', 'first': 'abc'}
{'middle': 'def_ghi_jkl', 'last': 'xyz', 'first': 'abc'}
Gumbo
I should have added than my strings are actually part of a bigger strings, so I can't really use the $ anchor. I'm updating the main question right now.
Charles
@Charles: Well, you could use `/` and `(?:/|$)` instead of `^` and `$`.
Gumbo
A: 

No need to be that complicated.

>>> s="abc_def_ghi_jkl_xyz"
>>> s.rsplit("_",1)
>>> splitted=s.split("_")
>>> first=splitted[0]
>>> last=splitted[-1]
>>> middle=splitted[1:-1]
>>> middle='_'.join(splitted[1:-1])
>>> print middle
def_ghi_jkl
ghostdog74
My problem is a small part of a bigger, heavily regex'ed problem. Thanks for your answer anyway.
Charles
A: 

Thanks for your help everyone ! The two keys of my problem where : - adding an anchor at the end of my pattern - making the middle group non greedy.

So :

/start/of/the/path/(?P<a>[a-z]+)_(?P<b>\w+?)(_(?P<c>[a-z]+))?(/|$)

That way all the following strings are matched :

/jobs/ads/abc_J123/previs/m_name
/jobs/ads/abc_J123/previs/m_name/
/jobs/ads/abc_J123/previs/m_name/some_stuff
/jobs/ads/abc_J123/previs/m_name/some_stuff/
/jobs/ads/abc_J123/previs/m_name/some_stuff/other_stuff
/jobs/ads/abc_J123/previs/m_name/some_stuff/other_stuff/
/jobs/ads/abc_J123/previs/m_name_stage
/jobs/ads/abc_J123/previs/m_name_stage/
/jobs/ads/abc_J123/previs/m_name_stage/some_stuff
/jobs/ads/abc_J123/previs/m_name_stage/some_stuff/
/jobs/ads/abc_J123/previs/m_name_stage/some_stuff/other_stuff
/jobs/ads/abc_J123/previs/m_name_stage/some_stuff/other_stuff/
/jobs/ads/abc_J123/previs/m_long_name_stage
/jobs/ads/abc_J123/previs/m_long_name_stage/
/jobs/ads/abc_J123/previs/m_long_name_stage/some_stuff
/jobs/ads/abc_J123/previs/m_long_name_stage/some_stuff/
/jobs/ads/abc_J123/previs/m_long_name_stage/some_stuff/other_stuff
/jobs/ads/abc_J123/previs/m_long_name_stage/some_stuff/other_stuff/

Thank you very much for your help !

Charles