tags:

views:

61

answers:

7

I have text that looks like:

My name is (Richard) and I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)

The goal is to search using a regular expression to find all parenthesized names that occur anywhere in the text BUT in-between any brackets.

So in the text above, the result I am looking for is:

  • Richard
  • Robert
  • Jill
A: 

It's not really the best job for a single regexp - have you considered, for example, making a copy of the string and then deleting everything in between the square brackets instead? It would then be fairly straight forward to extract things from inside the parenthesis. Alternatively, you could write a very basic parser that tokenises the line (into normal text, square bracket text, and parenthesised text, I imagine) and then parses the tree that produces; it'd be more work initially but would make life much simpler if you later want to make the behaviour any more complicated.

Having said that, /(?:(?:^|\])[^\[]*)\((.*?)\)/ does the trick for your test case (but it will almost certainly have some weird behaviour if your [ and ] aren't matched properly, and I'm not convinced it's that efficient).

A quick (PHP) test case:

preg_match_all('/(?:(?:^|\])[^\[]*)\((.*?)\)/', "My name is ... (Jill)", $m);

print(implode(", ", $m[1]));

Outputs:

Richard, Robert, Jill
Chris Smith
People always forget about negative lookahead and look behind...
Paulo Santos
@Paulo Santos: I don't know if it's that people "forget" about them, or if it's just that most people have a hard time getting negative assertions to work the way they expect, and so would rather just avoid using them.
Laurence Gonsalves
@Paulo: some of us just *wish* we could forget about them. :P Lookbehinds in particular are both much trickier and much less useful than many people expect them to be.
Alan Moore
+2  A: 

You can do it in two steps:

step1: match all bracket contents using:

\[[^\]]*\]

and replace it with ''

step2: match all the remaining parenthesized names(globally) using:

\([^)]*\)
codaddict
Yes, you can, but that wouldn't be that much fun would it?
stereofrog
A: 

IF you are using .NET you can do something like:

"(?<!\[.*?)(?<name>\(\w+\))(?>!.*\])"
Paulo Santos
Won't this fail to pick Robert out of the example? The lookbehind will find the `[` that contains Jack, and the lookahead will find Betty's `]`. The .s would need to be replaced with `[^\]]` and `[^\[]` respectively, I guess. Some regex engines don't support non-fixed-width negative lookbehinds, either.
Chris Smith
it's a **negative** look ahead and behind
Paulo Santos
I'm aware of this. Thinking about it more, I think this will fail to pick any names at all from the input - have you actually tried it? ;) For all names except Richard, the negative lookbehind will cause the match to fail (as `\[.*?` can trivially be matched ending at the start of all the other names), and for all except Jill the negative lookahead will cause it to fail for similar reasons.
Chris Smith
@Chris is right: it doesn't work as-is, and after making the changes he suggested it will only work in .NET or JGSoft (EditPad Pro, PowerGrep, etc.), because they're the only flavors that support unbounded lookbehind. Also, you've got the negative-lookahead syntax wrong. :-/
Alan Moore
A: 

a javascript example

t = "My name is (Richard) and (Tom) I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"

 m = t.match(/[^()]+(?=\)[^[\]]*(\[|$))/g)
console.log(m)

an english translation for those interested

  /        we want to find
  [^()]+   some non-parenthesis
  (?=      followed by
    \)        a parenthesis
    [^[\]]*   and some (or none) non-brackets
    (           and then
      \[|$       either a bracket or end-of-string
    )        
   )
   /        that is it.
   g        please find all occurences.
stereofrog
A: 
>>> s="My name is (Richard) and I cannot do [whatever (Jack) can't do (Jill) can] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"
>>> for item in s.split("]"):
...     st = item.split("[")[0]
...     if ")" in st:
...         for i in  st.split(")"):
...             if "(" in i:
...                print i.split("(")[-1]
...
Richard
Robert
Jill
ghostdog74
+1  A: 

You didn't say what language you're using, so here's some Python:

>>> import re
>>> REGEX = re.compile(r'(?:[^[(]+|\(([^)]*)\)|\[[^]]*])')
>>> s="""My name is (Richard) and I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"""
>>> filter(None, REGEX.findall(s))

The output is:

['Richard', 'Robert', 'Jill']

One caveat is that this does not work with arbitrary nesting. The only nesting it's really designed to work with is one level of parens in square brackets as mentioned in the question. Arbitrary nesting can't be done with just regular expressions. (This is a consequence of the pumping lemma for regular languages.)

The regex looks for chunks of text without brackets or parens, chunks of text enclosed in parens, and chunks of text enclosed in brackets. Only text in parens (not in square brackets) is captured. Python's findall finds all matches of the regex in sequence. In some languages you may need to write a loop to repeatedly match. For non-paren matches, findall inserts an empty string in the result list, so the call to filter removes those.

Laurence Gonsalves
A: 

So you want the regex to match the name, but not the enclosing parentheses? This should do it:

[^()]+(?=\)[^\[\]]*(?:\[[^\[\]]*\][^\[\]]*)*$)

As with the other answers, I'm making certain assumptions about your target string, like expecting parentheses and square brackets to be correctly balanced and not nested.

I say it should work because, although I've tested it, I don't know what language/tool you're using to do the regex matching with. We could provide higher-quality answers if we had that info; all regex flavors are not created equal.

Alan Moore