ansaurus

Question

Answer 1

+4 A:

A regular expression probably isn't the best tool for the job (since it appears that you can have arbitrarily-nested braces). I think you might be better off writing a parser based on some grammar (that you'll have to define).

Here is an EBNF to get you started; it's incomplete because I don't know what things can be inside your block (other than more blocks):

blocks        ::= { block }
block         ::= "{", block-content, "}"
block-content ::= blocks | things-other-than-blocks

For some resources on parsing, take a look at this answer.

Vivin Paliath 2010-08-18 21:23:20

Answer 2

+1 A:

If you know before hand the max level of nesting that can occur: Regex Recursion Without Balancing Groups (Matching Nested Constructs)

This should work for your example case.

But if Vivin's assumption is correct and you are dealing with arbitrarily deep nesting, you'll want to follow his advice and write a parser.

Or... if you're desperate, a solution using the .NET implementation of regex - Balancing Groups or there are also perl regex solutions perl solution 1, perl solution 2, perl solution 3. These solutions can handle the unknown level of nesting, but alas are not java regex compatible. :(

new Thrall 2010-08-18 22:47:10

Answer 3

+2 A:

IF there can only be at most 1 level of nesting, and the braces characters can not be escaped, then in fact the regex pattern for this is quite simple.

Essentially the structure we have, in some abstract notation, is:

{…(?:{…}…)*…}

Here's a visual breakdown:

  ___top___
 /   nest  \
/    / \    \
{…(?:{…}…)*…}
| \______/| |
|         | |
open      | close
          |
     zero or more

This is not quite regex, of course, because:

In "real" regex, we must escape the { and }, since they're metacharacters
In "real" regex, we need to replace … with the actual pattern for content
- [^{}]*+ would be a fine pattern. The […] is a character class. [^…] is a negated character class. The * is zero-or-more repetition. The + following the repetition specifier is the possessive quantifier.

So, meta-regexing technique is used to programmatically transform this abstract pattern (which is readable) to valid regex pattern (which can be ugly at times like this). Here's an example (also see on ideone.com):

    import java.util.*;
    import java.util.regex.*;
    //...

    Pattern block = Pattern.compile(
        "{…(?:{…}…)*…}"
            .replaceAll("[{}]", "\\\\$0")
            .replace("…", "[^{}]*+")
    );
    System.out.println(block.pattern());
    // \{[^{}]*+(?:\{[^{}]*+\}[^{}]*+)*[^{}]*+\}

    String text
        = "{ main1 { sub1a } { sub1b } { sub1c } }\n"
        + "{ main2\n"
        + "   { sub2a }\n"
        + "       { sub2c }\n"
        + "}"
        + "   { last one, promise }    ";

    Matcher m = block.matcher(text);
    while (m.find()) {
        System.out.printf(">>> %s <<<%n", m.group());
    }
    // >>> { main1 { sub1a } { sub1b } { sub1c } } <<<
    // >>> { main2
    //    { sub2a }
    //        { sub2c }
    // } <<<
    // >>> { last one, promise } <<<

As you can see, the actual regex pattern is therefore:

\{[^{}]*+(?:\{[^{}]*+\}[^{}]*+)*[^{}]*+\}

Which as a Java string literal:

"\\{[^{}]*+(?:\\{[^{}]*+\\}[^{}]*+)*[^{}]*+\\}"

Variations

If the nesting level can be deeper, then regex can still be used. You can also allow the { and } to be "escaped" (i.e. used in the content part but not as block delimiter).

The final regex pattern will be quite complicated, but depending on how comfortable you are with meta-regexing (which requires you to be comfortable with regex itself), the code can be quite readable and manageable.

If the nesting level can be arbitrarily deep, then some flavors (e.g. .NET or Perl) can still handle it, but Java regex is not powerful enough to handle it.

polygenelubricants 2010-08-19 08:45:49

ansaurus

tags:

views:

answers:

Java Regular Expression

Variations

related questions