ansaurus

Question

Answer 1

+1 A:

JoshD correctly points out that this grammar you've proposed (having matching pairs of brackets) cannot be parsed using a regular expression. You need to construct a custom parser with recursive descent behavior.

James Dunne 2010-10-12 03:59:57

It can be done with .NET, its just very ugly. See here: http://msdn.microsoft.com/en-us/library/bs2twtah%28VS.85%29.aspx#BalancingGroupDefinitionExample

Jens 2010-10-12 06:58:42

That link looks like it directly addressing my question, but I don't have the regex skills to tailor it to my need. It sounds like a parser is the better approach anyway.

Shawn 2010-10-12 07:07:45

It's not a question of your skill with using regular expressions. This grammar simply cannot be parsed with regular expressions. Regular expressions parse regular languages. Not all languages are regular. Look it up :)

James Dunne 2010-10-12 14:07:52

@Jens: That link is misleading. You may be able to match pairs of braces, but you cannot match every opening brace with a closing brace and end up with every pair matched.

James Dunne 2010-10-12 14:13:35

@James, regular expressions in a more theoretical/mathematical sense indeed only match/parse regular languages, but most modern day regular expression implementations can match/parse more than regular languages. I'm not even talking about recursive patterns, think 'back-references': `(.).*?\1`.

Bart Kiers 2010-10-13 07:18:24

@Bart: I'm aware that regular expressions in modern implementations include far more features which enable them to parse more complicated languages than just the regular languages. However, when parsing languages like these, regular expressions are not the right tool for the job due to their inherit limitations, regardless of implementation extensions.You can also gain accurate parse error reporting with a manual or auto-generated parser implementation; not something easily done with a regular expression, if possible.

James Dunne 2010-10-13 16:48:10

Furthermore, it is commonly found that parsing HTML with regular expression leads to madness. There are several popular SO questions referencing this sage wisdom. This case is no different because the crux of the HTML issue is that the angle-brackets need to be matched, similar to how the square-brackets need to be matched in pairs here.

James Dunne 2010-10-13 16:50:35

@James, note that I never said it would be a good idea to use regex for tasks like validating/parsing a language like (X)HTML. I simply said that you can't just say that something can't be done using a (modern day) regex-engine because the target language/string is not "regular". For example, if you ant to match the character that occurs at least 4 times in a string, you could do that using a regex like: `(.)(?:.*?\1){4}`, which matches `cdbcbccaac` from the target string `abcdbcbccaacabdddd`. But this "language" is, AFAIK, not regular (but suitable for a (modern-day) regex, IMO).

Bart Kiers 2010-10-13 17:26:42

@James, ... and of course I agree with you that HTML and regex shouldn't belong in the same sentence (unless a 'not' or 'never' is present)! :)

Bart Kiers 2010-10-13 17:28:13

Answer 2

+1 A:

Do I understand you correctly, that all strings you want to parse have the form

[id1 [id2 [id3 [id4 .. value]] ... ],

i.e. all brackets are closing at the end? Your question and examples seem to point that way. If thats true, parsing it using regex it not that difficult, depending on what you actually need your parser to do.

You could, say, use

static Tuple<String, String> Parse(String s)
{

    var match = Regex.Match(s, @"^\[(\w*) (.*)\]$", RegexOptions.None);
    return new Tuple<String, String>(match.Groups[1].ToString(), match.Groups[2].ToString());
}

That would result in

var result = Parse("[animal [dog rufus]]");
// result = {Item 1 = "animal", Item2 = "[dog rufus]" }
var inner = Parse(result.Item2);
// inner = { Item 1 = "dog", Item2 ="rufus"}

You could call Parse recursivly to get to the inner nesting levels.

Please ask if you have requirements I did not understand =)

Jens 2010-10-13 07:09:58

This will work only if the second element of the tuple is always what needs to be recursed. I have not verified it myself, but I'm quite sure that this will fail to parse "[[a b] [c d]]". That depends on the OP's grammar, of course.

James Dunne 2010-10-13 17:18:39

ansaurus

tags:

views:

answers:

Nested Regex Replace in C#

related questions