views:

168

answers:

4

The input is a string representing a list of elements.

A list is defined as an open curly { followed by 0 or more elements separated by whitespace followed by a closed curly }.

An element is either a literal or a list of elements.

A literal is a succession of non-whitespace characters. If an element contains a curly bracket, it must be escaped with a backslash : \{ and \}. (Or you could assume curlies are not allowed inside literals, for simplicity)

Example:

"{abc { def ghi } 7 { 1 {2} {3 4} } {5 6} x\{yz \}foo }"

No curlies inside literals:

"{abc { def ghi } 7 { 1 {2} {3 4} } {5 6} xyz foo }"

(This is a simplified definition of a Tcl list.)

What I want to know is: can the input be split into the elements of the outermost loop using regex?

Expected output:

abc
{ def ghi }
7
{ 1 {2} {3 4} }
{5 6}
x{yz
}foo

The real question is: can this be done with a Regex?

I'm most interested in the .NET flavour, but will accept any answers.

I'll post my own assumption in an answer, and see if it's validated or destroyed.

+2  A: 

The traditional answer to this is a resounding "NO". As we have learned in the compilers class, a regular grammar can't be used to describe a language with a recursive definition (i.e. you can't use a finite state machine)

What's needed here is a context free parser, the implementation of which boils down to a finite state machine + a STACK.
See ANTLR, bison etc.

Cristi Diaconescu
Next time you might want to consider leaving a few minutes before posting your answer as it is difficult for other people to get any votes if there is already an upvoted post, so this may discourage many other people from posting... or even from looking at the question (if the question is already answered many people won't even see it). I do assume you are interested in receiving other opinions, otherwise you wouldn't have posted... right? PS: in .NET I believe that it is possible to do this using "regular" expressions but you're right that it's not advisable to use regex for this purpose.
Mark Byers
@mark Note taken. And yes, I'm quite interested in the answers. I remember having read somewhere about some less orthodox extensions to a regex library that allows matching parens under circumstances, but I can't remember which library or which circumstances...
Cristi Diaconescu
I can't follow any of this. People should wait before posting correct answers? Regular expressions can be used for languages requiring a DPDA?
EJP
@EJP: It's a self-answer and *strictly* speaking it isn't correct because a "regular" expression in .NET is not actually regular at all and in fact it can parse the above. But this answer is still good advice in general. There's nothing wrong with posting self answers and nothing wrong with this answer, I'm just saying if he actaully *wants different opinions* rather than just a quick rep boost (and I assume he does) then he should have waited before posting to avoid discouraging other people from answering.
Mark Byers
@mark I re-read your comment. Are you saying this CAN be done in .NET?Point me to some evidence and you'll get the accepted answer.
Cristi Diaconescu
@EJP Wasn't familiar with the DPDA acronym so I looked it up. The 1st match on Google is hillarious, in a toilet humour way :)
Cristi Diaconescu
@Mark, not just .NET but pretty much all regex-implementations of programming languages used nowadays can't be called "regular". If a regex implementation supports groups and back-references to those groups, or look around assertions, it isn't "regular" in the more strict definition of it.
Bart Kiers
@Crista Diaconescu: The best info I could find on this subject was here: http://stackoverflow.com/questions/3349999/net-regex-recursive-patterns
Mark Byers
+1  A: 

@Cristi is right about the regex: It is theoretically impossible to solve recursive expressions using a stackless, finite-state automaton. The solution, however, is simpler: you only need to keep a counter of the number of open parentheses, and make sure it doesn't drop below 0. It is more memory-saving than maintaining a stack, and you only need the count - not the contents - of the Parentheses.

Algorithm:

counter = 0                        // Number of open parens
For char c in string:              
    print c                        
    if c=='{':                     // Keep track on number of open parens
        counter++
    if c=='}':
        counter--
    if counter==1:                 // New line if we're back to the top level
        print "\n"
    elif counter<1:                // Error if the nesting is malformed
        print "ERROR: parentheses mismatch"
        break
Adam Matan
This doesn't take into account the escaped curlies...
Cristi Diaconescu
True, but the fix is quite simple.
Adam Matan
+4  A: 

Unfortunately the answer is YES for some flavor of Regex, e.g. PCRE and .NET because they support recursive pattern and stack-like operations respectively.

The grammar can be written as

ELEMENT  -> (?!\{)\S+ | LIST
LIST     -> '\{\s*' ELEMENT? ('\s+' ELEMENT)* '\s*\}' 

thus in PCRE, this can be transformed into the pattern:

   \{\s*(?0)?(?:\s+(?0))*\s*\}|(?!\{)(?:[^\s}]|\}(?![\s}]))+

#  ---------------------------                   ^^^^^^^^^
#            LIST                    Make sure the } is not closing the group

See http://www.ideone.com/SnGsU for example (I have stripped the top-level { and } for simplicity).

(Of course, don't try this at work :) )

(BTW, I don't know how to transform this PCRE into .NET flavor. If someone knows, please try http://stackoverflow.com/questions/3349999)

KennyTM
Wow! Just one question: in your grammar definition, what does the (?!\{) at the beginning of ELEMENT mean?
Cristi Diaconescu
@Cristi: It is a [negative lookahead](http://www.regular-expressions.info/lookaround.html).
KennyTM
I wish I could select two answers as 'accepted answer', as this one is pretty complete. However, Kobi's answer is a better match for what I was searching, and the regex there is IMO quite a bit more readable.
Cristi Diaconescu
+1  A: 

Well, the edit removes curly braces from tokens and takes the sting from the question, and now it is easily doable with .Net Regexes, using balancing groups. It is simply matching braces, which is a basic example.
Much like KennyTM's answer, this will only work if you remove the top level braces, or it will match the whole input.
Again, this is better used for recreational purposes:

(?:                    # try matching...
    (?:\\[{}]|[^\s{}])+\s*? # a literal (allow escaped curly braces)
    |                       # OR
    (?<Curly>{)\s*          # "{" and push to stack
    |                       # OR
    (?<-Curly>})\s*?        # "}", pop from stack and fail if the stack is empty
)+?                    # ...a few times, and stop whenever you can.
(?(Curly)(?!))         # Make sure there aren't any extra open curly braces

For much more details see this article: Regex Balancing Group in Depth

Kobi
Before the question was updated I couldn't get this to work. On the other question we *validate* from start to end (`^(?:...)+$`), so the engine must try every combination. However, when you *match* for tokens, the engine can satisfy with less, and it's difficult setting priorities.
Kobi