views:

77

answers:

4

Hi, i wonder if it's possible to make a RegEx for the following data pattern:

'152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'

string = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'

I am using this Regular Expression (Using Python's re module) to extract these names:

re.findall(r'(\d+): (.+), (.+), (.+), (.+).', string, re.M | re.S)

Result:

[('152', 'Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD')]

Now trying with a different number (less than 4 or more than 4) of name data pattern doesn't work anymore because the RegEx expects to find only 4 of them:

(.+), (.+), (.+), (.+).

I can't find a way to generalize this pattern.

+1  A: 

This should do the trick if you only want the stuff after the numbers:

re.findall(r'\d+: (.+)(?:, .+)*\.', input, re.M | re.S)

And if you want everything:

re.findall(r'(\d+): (.+)(?:, .+)*\.', input, re.M | re.S)

And if you want to get them separated out into a list of matches, a nested regex will do it:

re.findall(r'[^,]+,|[^,]+$', re.findall(r'\d+: (.+)(?:, .+)*\.', input, re.M | re.S)[0],re.M|re.S)
JGB146
You should test this: it doesn't work.
Ned Batchelder
Odd. The same regex is working for me. That said, after looking back at his input the final `.` should probably be a literal `\.`
JGB146
Ah, with another look I see what you mean (I think). I've edited so that the extraneous other junk isn't included (unless he wants it).
JGB146
It works but i still need to split the names with a ".split(',')".
Gianluca Bargelli
Another option added: this one returns the individual matches.
JGB146
My result is ['Ashkenazi A,', ' Benlifer A,', ' Korenblit J,', ''], one name is missing from the list.
Gianluca Bargelli
Ok, I'm off at dinner. Will perfect it to match exactly when I get home. I think I can get rid of the blank match too.
JGB146
90% sure that it will work if you change $ to [^,]+$
JGB146
There we go. As I thought, the latest edit is returning everything exactly as desired.
JGB146
It works correctly, thanks!
Gianluca Bargelli
+6  A: 

A regular expression probably isn't the best way to solve this. You could use split():

>>> s = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
>>> s.split(": ")
['152', 'Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.']
>>> s.split(": ")[1].split(", ")
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD.']
Greg Hewgill
I am considering to mark this one as the solution to my problem, but i'll wait to see if someone else can provide a pure RegEx solution to my question. Just curious :)
Gianluca Bargelli
A: 

If you means that there may be more (or less too) names, you should maybe try something like this: (\d+): (.+)*? Asterisk (*) means 0 or more occurrence of (.+)

Ventus
A: 

I can get close, but further processing may be necessary. It is probably better to do manual string splitting, especially if the data is reliably well-formatted.

Code

import re
string1 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
string2 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD, Hattingh CJR.'
for i in [string1, string2]:
    print re.findall(r'(\d+):|(?:[.,\s?])?(.*?)(?:[.,])', i)

Output

[('152', ''), ('', 'Ashkenazi A'), ('', 'Benlifer A'), ('', 'Korenblit J'), ('', 'Silberstein SD')]
[('152', ''), ('', 'Ashkenazi A'), ('', 'Benlifer A'), ('', 'Korenblit J'), ('', 'Silberstein SD'), ('', 'Hattingh CJR')]

Edit: using 2 expressions

If you are willing to use two regex expressions, it can be done fairly painlessly:

import re
string1 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
string2 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD, Hattingh CJR.'
for i in [string1, string2]:
    print re.findall(r'^(\d+):', i)
    print re.findall(r'(?:[:,] )(\S+ [A-Z]+)(?=[\.,])', i)

produces

['152']
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD']
['152']
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD', 'Hattingh CJR']
cjrh
Well you got near indeed :) i can barely read that regular expression!
Gianluca Bargelli
Nice solution! :) It is similiar to @JGB146 's as it requires more than one regex. Thanks!
Gianluca Bargelli