views:

515

answers:

9

Hi, I got a string of such format:

"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"

so basicly it's list of actor's names (optionally followed by their role in parenthesis). The role itself can contain comma (actor's name can not, I strongly hope so).

My goal is to split this string into a list of pairs - (actor name, actor role).

One obvious solution would be to go through each character, check for occurances of '(', ')' and ',' and split it whenever a comma outside occures. But this seems a bit heavy...

I was thinking about spliting it using a regexp: first split the string by parenthesis:

import re
x = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
s = re.split(r'[()]', x) 
# ['Wilbur Smith ', 'Billy, son of John', ', Eddie Murphy ', 'John', ', Elvis Presley, Jane Doe ', 'Jane Doe', '']

The odd elements here are actor names, even are the roles. Then I could split the names by commas and somehow extract the name-role pairs. But this seems even worse then my 1st approach.

Are there any easier / nicer ways to do this, either with a single regexp or a nice piece of code?

+4  A: 

I think the best way to approach this would be to use python's built-in csv module.

Because the csv module only allows a one character quotechar, you would need to do a replace on your inputs to convert () to something like | or ". Then make sure you are using an appropriate dialect and off you go.

Wogan
A: 

I certainly agree with @Wogan above, that using the CSV moudle is a good approach. Having said that if you still want to try a regex solution give this a try, but you will have to adapt it to Python dialect

string.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

HTH

Anand
+6  A: 

One way to do it is to use findall with a regex that greedily matches things that can go between separators. eg:

>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']

The regex above matches one or more:

  • non-comma, non-open-paren characters
  • strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren

One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.

Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:

"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"

If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).

Laurence Gonsalves
You can split into fields right away by matching records instead of separators: [(m.group("name"), m.group("role")) for m in re.findall("(?P<name>.+?)( ?\((?P<role>[^\)]+)\)(,\s*|$))", x)]
Alex Lebedev
+1 for the token solution if he needs it. Pop on and off the stack as you walk up and down... a classic way to do it.
Paul McMillan
every time I see the regular expression that's useful, like this one, I start to wonder - should they be human-readable? Or it's just me... who doesn't see it from first glance?
kender
What's with the drive-by downvotes? This answer has been downvoted twice with no explanation. If you're going to downvote, say why.
Laurence Gonsalves
A: 

My answer will not use regex.

I think simple character scanner with state "in_actor_name" should work. Remember then state "in_actor_name" is terminated either by ')' or by comma in this state.

My try:

s = 'Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)'

in_actor_name = 1
role = ''
name = ''
for c in s:
    if c == ')' or (c == ',' and in_actor_name):
     in_actor_name = 1
     name = name.strip()
     if name:
      print "%s: %s" % (name, role)
     name = ''
     role = ''
    elif c == '(':
     in_actor_name = 0
    else:
     if in_actor_name:
      name += c
     else:
      role += c
if name:
    print "%s: %s" % (name, role)

Output:

Wilbur Smith: Billy, son of John
Eddie Murphy: John
Elvis Presley: 
Jane Doe: Jane Doe
Michał Niklas
A: 

split by ")"

>>> s="Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> s.split(")")
['Wilbur Smith (Billy, son of John', ', Eddie Murphy (John', ', Elvis Presley, Jane Doe (Jane Doe', '']
>>> for i in s.split(")"):
...   print i.split("(")
...
['Wilbur Smith ', 'Billy, son of John']
[', Eddie Murphy ', 'John']
[', Elvis Presley, Jane Doe ', 'Jane Doe']
['']

you can do further checking to get those names that doesn't come with ().

ghostdog74
+2  A: 
s = re.split(r',\s*(?=[^)]*(?:\(|$))', x)

The lookahead matches everything up to the next open-parenthesis or to the end of the string, iff there's no close-parenthesis in between. That ensures that the comma is not inside a set of parentheses.

Alan Moore
+2  A: 

An attempt on human-readable regex:

import re

regex = re.compile(r"""
    # name starts and ends on word boundary
    # no '(' or commas in the name
    (?P<name>\b[^(,]+\b)
    \s*
    # everything inside parentheses is a role
    (?:\(
      (?P<role>[^)]+)
    \))? # role is optional
    """, re.VERBOSE)

s = ("Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley,"
     "Jane Doe (Jane Doe)")
print re.findall(regex, s)

Output:

[('Wilbur Smith', 'Billy, son of John'), ('Eddie Murphy', 'John'), 
 ('Elvis Presley', ''), ('Jane Doe', 'Jane Doe')]
J.F. Sebastian
Human readable regex - isn't that an oxymoron?
Amarghosh
A: 

Here's a general technique I've used in the past for such cases:

Use the sub function of the re module with a function as replacement argument. The function keeps track of opening and closing parens, brackets and braces, as well as single and double quotes, and performs a replacement only outside of such bracketed and quoted substrings. You can then replace the non-bracketed/quoted commas with another character which you're sure doesn't appear in the string (I use the ASCII/Unicode group-separator: chr(29) code), then do a simple string.split on that character. Here's the code:

import re
def srchrepl(srch, repl, string):
    """Replace non-bracketed/quoted occurrences of srch with repl in string"""

    resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>["""
                            + srch + """])|(?P<rbrkt>[)\]}])""")
    return resrchrepl.sub(_subfact(repl), string)

def _subfact(repl):
    """Replacement function factory for regex sub method in srchrepl."""
    level = 0
    qtflags = 0
    def subf(mo):
        nonlocal level, qtflags
        sepfound = mo.group('sep')
        if  sepfound:
            if level == 0 and qtflags == 0:
                return repl
            else:
                return mo.group(0)
        elif mo.group('lbrkt'):
            level += 1
            return mo.group(0)
        elif mo.group('quote') == "'":
            qtflags ^= 1            # toggle bit 1
            return "'"
        elif mo.group('quote') == '"':
            qtflags ^= 2            # toggle bit 2
            return '"'
        elif mo.group('rbrkt'):
            level -= 1
            return mo.group(0)
    return subf

If you don't have nonlocal in your version of Python, just change it to global and define level and qtflags at the module level.

Here's how it's used:

>>> GRPSEP = chr(29)
>>> string = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> lst = srchrepl(',', GRPSEP, string).split(GRPSEP)
>>> lst
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']
Don O'Donnell
A: 

None of the answers above are correct if there are any errors or noise in your data.

It's easy to come up with a good solution if you know the data is right every time. But what happens if there are formatting errors? What do you want to have happen?

Suppose there are nesting parentheses? Suppose there are unmatched parentheses? Suppose the string ends with or begins with a comma, or has two in a row?

All of the above solutions will produce more or less garbage and not report it to you.

Were it up to me, I'd start with a pretty strict restriction on what "correct" data was - no nesting parentheses, no unmatched parentheses, and no empty segments before, between or after comments - validate as I went, and then raise an exception if I wasn't able to validate.

Tom Swirly
We have to assume the question contains all the information we need to answer it. Thus we assume the input has already been validated and that the format has been described completely (e.g., no nested parentheses). If any of those assumptions turn out to be wrong, one hopes the OP will learn to ask better questions in the future. ;)
Alan Moore