views:

46

answers:

3

Consider following script (it's total nonsense in pseudo-language):

if (Request.hostMatch("asfasfasf.com") && someString.existsIn(new String[] {"brr", "hrr"}))   {
    if (Requqest.clientIp("10.0.x.x")) {
        somevar = "1";
    }
    somevar = "2";
}
else {
    somevar = "first";
}
string foo = "foo";
// etc. etc.

How would you grab if-block's parameters and contents from it? The if-block has format of:

if<whitespace>(<parameters>)<whitespace>{<contents>}<anything>

I tried using String.split() with regex pattern of ^if\s*\(|\)\s*\{|\}\s* but this fails miserably. Namely, the problem is that ) { is found also in inner if-block and the closing } is found from many places as well. I don't think neither lazy or eager expansion works here.

So... any pointers to what might I need here in order to implement this with regex?

I also need to get the remaining string without the if-block's code (so code starting from else { ...). Using just String.split() seems to make it difficult as there is no information about the length of the parts that were parsed away.

I initially created a loop based solution (using String.substring() heavily) for this, but it's dull. I would like to have something fancier instead. Should I go with regex or create a custom, generic function (there are many other cases than just this) that takes the parseable String and the pattern instead (consider the if<whitespace>(... pattern above)?

Edit: Changed returns to variable assignments as it would have not made sense otherwise.

+1  A: 

A regular language won't work because a regular grammar can't match things like "any number of open parenthesis followed by any number of close parenthesis". A context-free grammar would be needed for that.

Unless you use a context-free grammar parser for Java or a regular expression extension that makes regular expressions no longer regular, your loop-based solution is probably the fanciest solution.

Trey
Wow, this just got way more technical than I was excepting. I clearly have things to study here. Anyway, I guess this is the resolution then, thanks for the explanation and links!
Tuukka Mustonen
For more information on this field look up formal language and automata theory.
Trey
+2  A: 

You'd be far better off using (or writing) a parser than trying to do this with Regex.

Regex is great for somethings, but for complex parsing like this, it sucks. Another example where it sucks that gets asked a lot here is parsing HTML - you can do it to a limited degree, but for anything complex, a DOM parser is a much better solution.

For a [very] simple parser, what you need is a recursive function that searches for a braces { and }, recursing down a level each time it comes across an opening brace, and returning back up a level when it finds a closing brace. It then needs to store the string contents between the two braces at each level.

Spudley
Nice to hear such honest opinions and suggestions. My current implementation doesn't use recursion but I agree that would actually be a better solution. Thanks for your input, unfortunately I'm picking Trey's answer as accepted, sorry :)
Tuukka Mustonen
+1  A: 

As per the above, you'll need a parser. One type that's easy to implement (and fun to write!) is a recursive descent parser with backtracking. There is also a plethora of parser generators out there, though most of those have a learning curve. One Java-friendly parser generator is JavaCC.

Tony Ennis
Thanks for the pointers. JavaCC documentation looks at lot more complete compared to that of Beaver's. However, both have learning curve as you say, so I'm probably going with a custom implementation. Reading about parsers I find out that my current implementation is actually pretty similar :)
Tuukka Mustonen
I've never used a parser generator. On the rare occasion I have needed one, it was simple enough to code by hand.
Tony Ennis