tags:

views:

253

answers:

5

Hi,

i want to parse VB6 code via Regex. However being new to Regex I have encountered a few problems concerning the regexes to use. Currently, I have problems recognizing these constructs:

' Subs
' Sub Test
Private Sub Test(ByVal x as Integer)
    'Private Sub Test(ByVal y as Integer)
    Dim dummy as String
    dummy = "Private Sub Test(ByVal y as Integer)"
End Sub

I have basically these 2 problems: How do I write a Regex that matches the Sub definition, and includes the all commment (and empty) lines above its definition? And how can I prevent that the Sub definitions which are either disabled by comment or included in strings aren't matched? Plus I need to support definitions which are spanned over multiple lines, like this one:

' Subs
' Sub Test
Private Function Test2( _
   ByVal x as Integer _
) As Long
    'Private Sub Test(ByVal y as Integer)
    Dim dummy as String
    dummy = "Private Sub Test(ByVal y as Integer)"
End Function

Any hint would be greatly appreaciated. The solutions I've come up with don't work with multiple lines or capture more than just one Sub definition. It then just matches to the end of the last End Sub occurrence due to greedy matching.

My try in C# currently looks like this:

(('(?<comment>[\S \t]+[\n\r]+))*((?<accessmodifier>(Private|Public))\s+_?)(?<functiontype>(Sub|Function))\s+_?(?<name>[\S]+)\((?<parameters>[\S \t]*)\)([ \t]+As[ \t]+(?<returntype>\w+))?)|(?<endfunction>End (Sub|Function))

I'm using Multiline, Singleline, IgnoreCase, ExplicitCapture.

Thanks for your help!

+2  A: 

I suspect that this won't be possible for all but the simplest cases. With regexps you can't parse recursive structures, and languages (such as VB) will have recursive features. See this CodingHorror blog entry for more info.

Unless you have very simple cases, I think some form of parser is going to be the way forward.

Brian Agnew
+1  A: 

You know, eventually there comes a time when regular expressions just aren't enough. Parsing at this level is one of them.

Consider instead writing a simple real parser, maybe using recursive descent.

Charlie Martin
+1  A: 

Don't try to write one regex to do this for you (it can't by its very nature). What you need is a parser. Probably the easiest solution is to use a recursive descent parser. I don't use C#, but a quick search turned up Spart.

Chas. Owens
A: 

Given the complexity and intrication of Visual Basic, you will probably need to parse code using a tokenizer/parser. You can't rely on regexes for everything ;)

For what it's worth, the VB formal grammar is available here. Have fun!

NicDumZ
+1  A: 

Why are you parsing this code? If you're trying to create your own compiler, you'll need a lot more than regexes. If you're writing an editor with syntax highlighting and type-ahead completion, regexes can do a pretty good job on the first, but not the second.

That said, the biggest problem I see with your regex is that you're not handling line continuations properly. This: \s+_? matches one or more whitespace characters, optionally followed by an underscore. But if there is an underscore it should be followed by a newline, which you aren't matching. That's easy enough to remedy - \s+(_\s+)? - but I'm not sure you need to be that specific. I suspect this: [\s_]+ will do just as well.

As for avoiding apparent sub/function declarations inside comments and strings, the simplest way would be to match them only at the left margin, with maybe some tabs or spaces for indentation. It's cheating, I know, but it may be good enough for whatever you're doing. I relied heavily on that trick when I was writing a Java file navigation scheme for EditPad Pro. You can't do this kind of thing with regexes without employing lots of gimmicks and simplifying assumptions. Try this regex:

^(?>('(?<comment>.*[\n\r]+))*)
[ \t]*(?<accessmodifier>(Private|Public))
[\s_]+(?<functiontype>(Sub|Function))
[\s_]+(?<name>\S+)
[\s_]*\((?<parameters>[^()]*)\)
([\s_]+As[\s_]+(?<returntype>\w+))?
|
^[ \t]*(?<endfunction>End (Sub|Function))

Of course you'll need to reassemble it first. It should be compiled with the Multiline, IgnoreCase and ExplicitCapture options, but not Singleline.

Alan Moore