views:

111

answers:

3

Hello, I am trying to write a regular expression to match and split a custom variable syntax in C#. The idea here is a custom formatting of string values very similar to the .NET String.Format/{0} style of string formatting.

For example the user would define a String format to be evaluated at runtime like so:

D:\Path\{LanguageId}\{PersonId}\

The value 'LanguageId' matches an data object field, and its current value replaces.

Things get tricky when there is a need to pass arguments to the formatting field. For example:

{LanguageId:English|Spanish|French}

This would have the meaning of executing some conditional logic if the value of 'LanguageId' was equal to one of the arguments.

Lastly I would need to support map arguments like this:

{LanguageId:English=>D:\path\english.xml|Spanish=>D:\path\spansih.xml}

Here is an enumeration of all possible values:

Command no argument: do something special

{@Date}

Command single argument:

{@Date:yyyy-mm-dd}

No argument:

{LanguageId}

Single argument-list:

{LanguageId:English}

Multi Argument-list:

{LanguageId:English|Spanish}

Single Argument-map:

{LanguageId:English=>D:\path\english.xml}

Multi Argument-map:

{LanguageId:English=>D:\path\english.xml|Spanish=>D:\path\spansih.xml}

Summary: The syntax can be boiled down to a Key with optional parameter type list or map (not both).

Below is the Regex I have so far which has a few problems, namely it doesnt handle all whitespace correctly, in .NET I dont get the splits I am expecting. For instance in the first example i am returned a single match of '{LanguageId}{PersonId}' instead of two distinct matches. Also i am sure it doesnt handle filesystem path, or delimited, quoted strings. Any help getting me over the hump would be appreciated. Or any recommendations.

    private const string RegexMatch = @"
        \{                              # opening curly brace
        [\s]*                           # whitespace before command
        @?                              # command indicator
        (.[^\}\|])+                       # string characters represening command or metadata
        (                               # begin grouping of params
        :                               # required param separater 
        (                               # begin select list param type

        (                               # begin group of list param type
        .+[^\}\|]                       # string of characters for the list item
        (\|.+[^\}\|])*                  # optional multiple list items with separator
        )                               # end select list param type

        |                               # or select map param type

        (                               # begin group of map param type
        .+[^\}\|]=>.+[^\}\|]            # string of characters for map key=>value pair
        (\|.+[^\}\|]=>.+[^\}\|])*       # optional multiple param map items
        )                               # end group map param type

        )                               # end select map param type
        )                               # end grouping of params
        ?                               # allow at most 1 param group
        \s*
        \}                              # closing curly brace
        ";
+3  A: 

You may want to take a look into implementing this as a Finate-State Machine instead of a regex, mainly for speed puropses. http://en.wikipedia.org/wiki/Finite-state%5Fmachine

Edit: Actually, to be precise, you want to look at Deterministic Finite State machines: http://en.wikipedia.org/wiki/Deterministic%5Ffinite-state%5Fmachine

Jaimal Chohan
Not to mention sanity of everyone involved. Though, regexs get compiled to FSM's, so it should be possible... I wouldn't really want to read it though. The fact that there is a comment for sets of 2-3 characters on most lines above illustrates my point.
Matthew Scharley
+3  A: 

You're trying to do too much with one regex. I suggest you break the task down into steps, the first being a simple match on something that looks like a variable. That regex could be as simple as:

\{\s*([^{}]+?)\s*\}

That saves your whole variable/command string in group #1, minus the braces and surrounding whitespace. After that you can split on colons, then pipes, then "=>" sequences as appropriate. Don't compress all the complexity into one monster regex; if you ever manage to get the regex written, you'll find it impossible to maintain when your requirements change later on.

And another thing: right now, you're focused on getting the code to work when the input is correct, but what about when the users get it wrong? Wouldn't you like to give them helpful feedback? Regexes suck at that; they're strictly pass/fail. Regexes can be amazingly useful, but like any other tool, you have to learn their limitations before you can harness their full power.

Alan Moore
Hey Alan, this was exactly the regex I needed. I already had all the code in place to parse the various param permutations and handle errors as you suggested. So this was very helpful. Thanks again
Karl
+1  A: 

This should really be parsed.

For an example, I wanted to parse this using Regexp::Grammars.

Please excuse the length.

#! /opt/perl/bin/perl
use strict;
use warnings;
use 5.10.1;

use Regexp::Grammars;

my $grammar = qr{
  ^<Path>$

  <objtoken: My::Path>
    <drive=([a-zA-Z])>:\\ <[elements=PathElement]> ** (\\) \\?

  <rule: PathElement>
    (?:
      <MATCH=BlockPathElement>
    |
      <MATCH=SimplePathElement>
    )

  <token: SimplePathElement>
    (?<= \\ ) <MATCH=([^\\]+)>

  <rule: My::BlockPathElement>
    (?<=\\){ \s*
    (?|
      <MATCH=Command>
    |
      <MATCH=Variable>
    )
    \s* }

  <objrule: My::Variable>
    <name=(\w++)> <options=VariableOptionList>?

  <rule: VariableOptionList>
      :
      <[MATCH=VariableOptionItem]> ** ([|])

  <token: VariableOptionItem>
    (?:
      <MATCH=VariableOptionMap>
    |
      <MATCH=( [^{}|]+? )>
    )

  <objrule: My::VariableOptionMap>
    \s*
    <name=(\w++)> => <value=([^{}|]+?)>
    \s*

  <objrule: My::Command>
    @ <name=(\w++)>
    (?:
      : <[arg=CommandArg]> ** ([|])
    )?

  <token: CommandArg>
    <MATCH=([^{}|]+?)> \s*

}x;

Testing with:

use YAML;
while( my $line = <> ){
  chomp $line;
  local %/;

  if( $line =~ $grammar ){
    say Dump \%/;
  }else{
    die "Error: $line\n";
  }
}

With sample data:

D:\Path\{LanguageId}\{PersonId}
E:\{ LanguageId : English | Spanish | French }
F:\Some Thing\{ LanguageId : English => D:\path\english.xml | Spanish => D:\path\spanish.xml }
C:\{@command}
c:\{@command :arg}
c:\{ @command : arg1 | arg2 }

Results in:

---
'': 'D:\Path\{LanguageId}\{PersonId}'
Path: !!perl/hash:My::Path
  '': 'D:\Path\{LanguageId}\{PersonId}'
  drive: D
  elements:
    - Path
    - !!perl/hash:My::Variable
      '': LanguageId
      name: LanguageId
    - !!perl/hash:My::Variable
      '': PersonId
      name: PersonId

---
'': 'E:\{ LanguageId : English | Spanish | French }'
Path: !!perl/hash:My::Path
  '': 'E:\{ LanguageId : English | Spanish | French }'
  drive: E
  elements:
    - !!perl/hash:My::Variable
      '': 'LanguageId : English | Spanish | French'
      name: LanguageId
      options:
        - English
        - Spanish
        - French

---
'': 'F:\Some Thing\{ LanguageId : English => D:\path\english.xml | Spanish => D:\path\spanish.xml }'
Path: !!perl/hash:My::Path
  '': 'F:\Some Thing\{ LanguageId : English => D:\path\english.xml | Spanish => D:\path\spanish.xml }'
  drive: F
  elements:
    - Some Thing
    - !!perl/hash:My::Variable
      '': 'LanguageId : English => D:\path\english.xml | Spanish => D:\path\spanish.xml '
      name: LanguageId
      options:
        - !!perl/hash:My::VariableOptionMap
          '': 'English => D:\path\english.xml '
          name: English
          value: D:\path\english.xml
        - !!perl/hash:My::VariableOptionMap
          '': 'Spanish => D:\path\spanish.xml '
          name: Spanish
          value: D:\path\spanish.xml

---
'': 'C:\{@command}'
Path: !!perl/hash:My::Path
  '': 'C:\{@command}'
  drive: C
  elements:
    - !!perl/hash:My::Command
      '': '@command'
      name: command

---
'': 'c:\{@command :arg}'
Path: !!perl/hash:My::Path
  '': 'c:\{@command :arg}'
  drive: c
  elements:
    - !!perl/hash:My::Command
      '': '@command :arg'
      arg:
        - arg
      name: command

---
'': 'c:\{ @command : arg1 | arg2 }'
Path: !!perl/hash:My::Path
  '': 'c:\{ @command : arg1 | arg2 }'
  drive: c
  elements:
    - !!perl/hash:My::Command
      '': '@command : arg1 | arg2 '
      arg:
        - arg1
        - arg2
      name: command

Sample program:

my %ARGS = qw'
  LanguageId  English
  PersonId    someone
';

while( my $line = <> ){
  chomp $line;
  local %/;

  if( $line =~ $grammar ){
    say $/{Path}->fill( %ARGS );
  }else{
    say 'Error: ', $line;
  }
}

{
  package My::Path;

  sub fill{
    my($self,%args) = @_;

    my $out = $self->{drive}.':';

    for my $element ( @{ $self->{elements} } ){
      if( ref $element ){
        $out .= '\\' . $element->fill(%args);
      }else{
        $out .= "\\$element";
      }
    }

    return $out;
  }
}
{
  package My::Variable;

  sub fill{
    my($self,%args) = @_;

    my $name = $self->{name};

    if( exists $args{$name} ){
      $self->_fill( $args{$name} );
    }else{
      my $lc_name = lc $name;

      my @possible = grep {
        lc $_ eq $lc_name
      } keys %args;

      die qq'Cannot find argument for variable "$name"\n' unless @possible;
      if( @possible > 1 ){
        my $die = qq'Cannot determine which argument matches "$name" closer:\n';
        for my $possible( @possible ){
          $die .= qq'  "$possible"\n';
        }
        die $die;
      }

      $self->_fill( $args{$possible[1]} );
    }
  }
  sub _fill{
    my($self,$opt) = @_;

    # This is just an example.
    unless( exists $self->{options} ){
      return $opt;
    }

    for my $element ( @{$self->{options}} ){
      if( ref $element ){
        return '['.$element->value.']' if lc $element->name eq lc $opt;
      }elsif( lc $element eq lc $opt ){
        return $opt;
      }
    }

    my $name = $self->{name};
    my $die = qq'Invalid argument "$opt" for "$name" :\n';
    for my $valid ( @{$self->{options}} ){
      $die .= qq'  "$valid"\n';
    }
    die $die;
  }
}
{
  package My::VariableOptionMap;

  sub name{
    my($self) = @_;

    return $self->{name};
  }
}
{
  package My::Command;

  sub fill{
    my($self,%args) = @_;

    return '['.$self->{''}.']';
  }
}
{
  package My::VariableOptionMap;

  sub name{
    my($self) = @_;
    return $self->{name};
  }

  sub value{
    my($self) = @_;
    return $self->{value};
  }
}

Output using the example data:

D:\Path\English\someone
E:\English
F:\Some Thing\[D:\path\english.xml]
C:\[@command]
c:\[@command :arg]
c:\[@command : arg1 | arg2 ]
Brad Gilbert
+1 for the great demo, but I don't think Perl 6's Grammars have been ported to .NET yet. :P
Alan Moore
Actually, it's not quite a Perl 6 Grammar. It's just loosely based on it.
Brad Gilbert
Plus, I wanted to write it, so I did.
Brad Gilbert