tags:

views:

73

answers:

2

I'm not really that good with regex, but I understand the basics. I'm trying to figure out how to do a conditional replace based upon a certain value in the match. For example:

Suppose I have some nested string structure that look like this:

"[id value]"//id and value are space delimited.  id will never have spaces

id is some string id that names the [] item and value is another nested [id value] item. Its possible for value to be empty, but I'm not worried about that for now.

If I have something like this:

A) "[vehicle [toyota camry]]"
or
B) "[animal [dog rufus]]"

I'd like to be able to call a certain function (ToString() for example) based upon id that gets output as the regex.Replace is executed from the inner most [] structure.

Going from example A pseudo code:

string Return = "{0}";
var 1stValueComboID = GetInteriorValue/IDFrom("[vehicle [toyota camry]]");
//1stValueComboID.ToString() = "Company: Toyota, Make: Camry"

Return = Format.String(Return,1stValueIDCombo.ToString());


var 2stValueComboID = GetSecondValue/IDFrom("[vehicle [toyota camry]]");
//2stValueComboID.ToString() = "Type: Vehicle, {0}"

Return = Format.String(Return,2ndValueIDCombo.ToString());

This sample obviously has nothing to do with regex, but it hopefully illustrates kind of what I'm trying to do.

+1  A: 

JoshD correctly points out that this grammar you've proposed (having matching pairs of brackets) cannot be parsed using a regular expression. You need to construct a custom parser with recursive descent behavior.

James Dunne
It can be done with .NET, its just very ugly. See here: http://msdn.microsoft.com/en-us/library/bs2twtah%28VS.85%29.aspx#BalancingGroupDefinitionExample
Jens
That link looks like it directly addressing my question, but I don't have the regex skills to tailor it to my need. It sounds like a parser is the better approach anyway.
Shawn
It's not a question of your skill with using regular expressions. This grammar simply cannot be parsed with regular expressions. Regular expressions parse regular languages. Not all languages are regular. Look it up :)
James Dunne
@Jens: That link is misleading. You may be able to match pairs of braces, but you cannot match every opening brace with a closing brace and end up with every pair matched.
James Dunne
@James, regular expressions in a more theoretical/mathematical sense indeed only match/parse regular languages, but most modern day regular expression implementations can match/parse more than regular languages. I'm not even talking about recursive patterns, think 'back-references': `(.).*?\1`.
Bart Kiers
@Bart: I'm aware that regular expressions in modern implementations include far more features which enable them to parse more complicated languages than just the regular languages. However, when parsing languages like these, regular expressions are not the right tool for the job due to their inherit limitations, regardless of implementation extensions.You can also gain accurate parse error reporting with a manual or auto-generated parser implementation; not something easily done with a regular expression, if possible.
James Dunne
Furthermore, it is commonly found that parsing HTML with regular expression leads to madness. There are several popular SO questions referencing this sage wisdom. This case is no different because the crux of the HTML issue is that the angle-brackets need to be matched, similar to how the square-brackets need to be matched in pairs here.
James Dunne
@James, note that I never said it would be a good idea to use regex for tasks like validating/parsing a language like (X)HTML. I simply said that you can't just say that something can't be done using a (modern day) regex-engine because the target language/string is not "regular". For example, if you ant to match the character that occurs at least 4 times in a string, you could do that using a regex like: `(.)(?:.*?\1){4}`, which matches `cdbcbccaac` from the target string `abcdbcbccaacabdddd`. But this "language" is, AFAIK, not regular (but suitable for a (modern-day) regex, IMO).
Bart Kiers
@James, ... and of course I agree with you that HTML and regex shouldn't belong in the same sentence (unless a 'not' or 'never' is present)! :)
Bart Kiers
+1  A: 

Do I understand you correctly, that all strings you want to parse have the form

[id1 [id2 [id3 [id4 .. value]] ... ],

i.e. all brackets are closing at the end? Your question and examples seem to point that way. If thats true, parsing it using regex it not that difficult, depending on what you actually need your parser to do.

You could, say, use

static Tuple<String, String> Parse(String s)
{

    var match = Regex.Match(s, @"^\[(\w*) (.*)\]$", RegexOptions.None);
    return new Tuple<String, String>(match.Groups[1].ToString(), match.Groups[2].ToString());
}

That would result in

var result = Parse("[animal [dog rufus]]");
// result = {Item 1 = "animal", Item2 = "[dog rufus]" }
var inner = Parse(result.Item2);
// inner = { Item 1 = "dog", Item2 ="rufus"}

You could call Parse recursivly to get to the inner nesting levels.

Please ask if you have requirements I did not understand =)

Jens
This will work only if the second element of the tuple is always what needs to be recursed. I have not verified it myself, but I'm quite sure that this will fail to parse "[[a b] [c d]]". That depends on the OP's grammar, of course.
James Dunne