tags:

views:

460

answers:

4

Is there any way to convert the following BNF into a .Net regex? (I'm not stuck on the BNF, but I thought it might be the best way to explain what I was trying to do)

<field> ::= "<<" <fieldname> <options> ">>"

<options> ::= "" | "(" <option> ")"

<option> ::= "" | 
             <option> <non-paren> | 
             <option> <escaped-character>

<escaped-character> ::= "\\" | "\)"

<non-paren> ::= any character but paren

<fieldname> ::= any string that doesn't contain "(" or ">>"

I'm close, but I can't figure out how to deal with escaping "\" and ")". This captures the fieldname and option in named groups.

<<(?<fieldname>.\*?)(\((?<option>.*?)\))?>>


Edit

It turns out that I was rustier at BNFs than I thought.

What I was trying to get at is that parens are special characters. Inside the "option" section, they must be escaped by a slash. (And slashes must also be escaped).

A: 

I've been pondering an answer, and sort of hoping someone would jump me so I could stop. :)

The recursive nature of a BNF is usually a good opening indicator that if your problem maps well to a BNF, it doesn't map well to a RegExp.

I have to admit, I'm not sure I can even figure out your BNF. For instance: x ::= << Boo ( abc321 ) >>

Would suggest your 'option' pairs are c3, b2, and a1. This assumes a char is a valid 'option' - you didn't define any valid terminal value for option that wasn't the empty string. Is that really the intent?

Assuming you did not want to be recursive... Dealing with escaping and everything else... You may just be better off writing code. This looks a lot easier to walk through the string and deal with than anything else. The feel of what you want suggests you don't need any look-ahead or look-back logic.

Joe
+4  A: 

BNF is used to describe context-free languages, which regex can't normally describe. What separates context-free languages and regex is that context-free langauges can have recursion on both sides at the same time. A classic example is the balanced parenthesis problem.

paren = paren paren
      | '(' paren ')'  <-- there are characters on both sides of the recursion
      | ''

In your case, you don't use any double-sided recursion, so it reduces to a regular language.

fieldname = /(?:>?[^(>])+/    //No double >, but single ones are ok.
option = /(?:[^()\\]|\\.)*/   //No parens, unless preceeded by \

pattern = /<<(?<fieldname>   )(?:\((?<option>   )\))?>>/

Putting it together:

pattern = /<<(?<fieldname>(?:>?[^(>])+)(?:\((?<option>(?:[^()\\]|\\.)*)\))?>>/

Some border cases:

<<f>oo(bar>>)>> --> ('f>oo', 'bar>>')
<<foo(bar\))>>  --> ('foo', 'bar\)')
<<foo(bar\\)>>  --> ('foo', 'bar\\')
<<foo\(bar)>>   --> ('foo\', 'bar')


EDIT:

If you want any extra parenthesis characters (and back-slashes) to have to be escaped inside << and >>, you could do something like this:

fieldname = /(?:<?[^()\\<]|<?\\[()\\])+/
options = /(?:[^()\\]|\\[()\\])*/
pattern = /<<(?<fieldname>   )(?:\((?<option>   )\))?>>/

/<<(?<fieldname>(?:<?[^()\\]|<?\\[()\\])+)(?:\((?<option>(?:[^()\\]|\\[()\\])*)\))?>>/

updated:

<<f>oo(bar>>)>> --> ('f>oo', 'bar>>')
<<foo(bar\))>>  --> ('foo', 'bar\)')
<<foo(bar\\)>>  --> ('foo', 'bar\\')
<<foo\(bar)>>   --> doesn't match
<<foo\((bar)>>  --> ('foo\(', 'bar')
MizardX
I think your option regex is just a hair off.
Adam Tegen
To yours? I posted before! :P ... Anyway, they are quite similar, but there are only so many ways to do the same.
MizardX
It doesn't look like it will match anything with a slash paren
Adam Tegen
A: 

I think I managed to get it to work...

<<(?<fieldname>[^\(]+)(?<options>\((?<option>(\\\\|\\\)|[^\\\)])*)\))?>>

The trick that I could think of was the option portion:

option =    (\\\\|\\\)||[^\\\)]

Which translates to: Either double-slash, slash-paren, or a non-slash-paren character.

Then include it 0 or more times and slap it in a group named "option":

((?<option>(\\\\|\\\)|[^\\\)])*)

I also changed fieldname to be one or more non-open-parens:

fieldname =     [^\(]+

Putting that together, I came up with the solution.

Adam Tegen
+1  A: 

Regular expressions denote regular languages. Context-free grammars generate context-free languages. The former language set is a subset of the latter and in the general case you cannot express a context-free language as a regular expression.