tags:

views:

98

answers:

4

I have this regex:

(?<!Sub ).*\(.*\)

And I'd like it to match this:

MsgBox ("The total run time to fix AREA and TD fields is: " & =imeElapsed & " minutes.")

But not this:

Sub ChangeAreaTD()

But somehow I still match the one that starts with Sub... does anyone have any idea why? I thought I'd be excluding "Sub " by doing

(?<!Sub )

Any help is appreciated!

Thanks.

+3  A: 

Do this:

^MsgBox .*\(.*\)

The problem is that a negative lookbehind does not guarantee the beginning of a string. It will match anywhere.

However, adding a ^ character at the beginning of the regex does guarantee the beginning of the string. Then, change Sub to MsgBox so it only matches strings that begin with MsgBox

SimpleCoder
The look-behind is meaningless in this case as there cannot be anything before the start of a string.
Gumbo
Ah thank you I have fixed it
SimpleCoder
This would help me if I was just trying to grab MsgBox calls but one thing I omitted from the original question was that I am in fact trying to grab all method calls BUT no method declarations.So what I need is:<any characters that aren't "Sub ">(<any character>)
Tiago Espinha
+1  A: 

You have a backtracking problem here. The first .* in (?<!Sub ).*\(.*\) can match ChangeAreaTD or hangeAreaTD. In the latter case, the previous 4 characters are ub C, which does not match Sub. As the lookbehind is negated, this counts as a match!

Just adding a ^ to the beginning of your regex will not help you, as look-behind is a zero-length matching phrase. ^(?<!MsgBox ) would be looking for a line that followed a line ending in MsgBox. What you need to do instead is ^(?!Sub )(.*\(.*\)). This can be interpreted as "Starting at the beginning of a string, make sure it does not start with Sub. Then, capture everything in the string if it looks like a method call".

A good explanation of how regex engines parse lookaround can be found here.

DonaldRay
A: 

If your wanting to match just the functions call, not declaration, then the pre bracket match should not match any characters, but more likely any identifier characters followed by spaces. Thus

(?<!Sub )[a-zA-Z][a-zA-Z0-9_]* *\(.*\)

The identifier may need more tokens depending on the language your matching.

Simeon Pilgrim
+2  A: 

Your regex (?<!Sub ).*\(.*\), taken apart:

(?<!         # negative look-behind
  Sub        #   the string "Sub " must not occur before the current position
)            # end negative look-behind
.*           # anything       ~ matches up to the end of the string!
\(           # a literal "("  ~ causes the regex to backtrack to the last "("
  .*         # anything       ~ matches up to the end of the string again!
\)           # a literal ")"  ~ causes the regex to backtrack to the last ")"

So, with your test string:

Sub ChangeAreaTD()
  • The look-behind is fulfilled immediately (right at position 0).
  • The .* travels to the end of the string after that.
  • Because of this .*, the look-behind never really makes a difference.

You were probably thinking of

(?<!Sub .*)\(.*\)

but it is very unlikely that variable-length look-behind is supported by your regex engine.

So what I would do is this (since variable-length look-ahead is widely supported):

^(?!.*\bSub\b)[^(]+\(([^)]+)\)

which translates as:

^           # At the start of the string,
(?!         # do a negative look-ahead:
  .*        #   anything
  \b        #   a word boundary
  Sub       #   the string "Sub"
  \b        #   another word bounday
)           # end negative look-ahead. If not found,
[^(]+       # match anything except an opening paren  ~ to prevent backtracking
\(          # match a literal "("
(           # match group 1
  [^)]+     #   match anything up to a closing paren  ~ to prevent backtracking
)           # end match group 1
\)          # match a literal ")".

and then go for the contents of match group 1.

However, regex generally is hideously ill-suited for parsing code. This is true for HTML the same way it is true for VB code. You will get wrong matches even with the improved regex. For example here, because of the nested parens:

MsgBox ("The total run time to fix all fields (AREA, TD)  is: ...")
Tomalak
Oh I see why it fails now. Would you recommend another way for parsing code then?I went with regex because one of the things I noticed about VBA is that each definition is always only one line long. Unlike C# where you can for instance have if (<condition>) and then the bracket in the next line, VBA always keeps everything to one line.
Tiago Espinha
@Tiago: This is not entirely true. You can insert an underscore at the end of a line and this counts as a line continuation. You can separate two logical lines of code with a colon (`Dim x: x = 10`). For parsing languages, always use a parser. Not sure if there is a parser out there that you can use to parse VB code, but it should not bee tooooo hard to write one. That's material for a separate question, though.
Tomalak
Good grief.. I didn't know about the underscore, or rather, I knew about it from VB .NET, I just didn't think it would also apply to VBA. I just checked and it does, as does the colon.I think I'll just end up replacing colons with '\r\n' and removing all instances of '_\r\n' as the underscore has to be the last character in the line for it to work.I've tried searching for a parser for VBA and there doesn't seem to be one, plus I have about three months to come up with something that creates an abstract syntax tree out of VBA code and then something else that generates C# out of that AST.
Tiago Espinha
In any case, thanks for your tips on the underscore and colon.
Tiago Espinha
@Tiago Replacing colons or underscores out of context will invalidate the program (think of colons or underscores in strings or comments). You will need a parser to get it right, everything else is bound to break at some point.
Tomalak