ansaurus

Question

Answer 1

A:

If you take a look at the ExpressionConverter in my WPF Converters library, it has basic lexing and parsing of C# expressions. No regex involved, from memory.

HTH, Kent

Kent Boogaart 2009-03-23 12:01:16

Answer 2

A:

I have done this several iterations back in my IDE, before using a proper lexer and parser.

Unfortunately I cannot find it under the source control (I suspect it was before I moved to SVN from CVS).

I will try keep an eye open for it. :)

leppie 2009-03-23 12:24:29

Answer 3

+4 A:

Unless you have a very unconventional grammar, I'd strongly recommend not to roll your own lexer/parser.

I usually find lexer/parsers for C# are really lacking. However, F# comes with fslex and fsyacc, which you can learn how to use in this tutorial. I've written several lexer/parsers in F# and used them in C#, and its very easy to do.

I suppose its not really a poor man's lexer/parser, seeing that you have to learn an entirely new language to get started, but its a start.

Juliet 2009-03-23 12:30:27

Answer 4

A:

Changing my original answer.

Take a look at SharpTemplate that has parsers for different syntax types, e.g.

#foreach ($product in $Products)
   <tr><td>$product.Name</td>
   #if ($product.Stock > 0)
      <td>In stock</td>
   #else
     <td>Backordered</td>
   #end
  </tr>
#end

It uses regexes for each type of token:

public class Velocity : SharpTemplateConfig
{
    public Velocity()
    {
     AddToken(TemplateTokenType.ForEach, @"#(foreach|{foreach})\s+\(\s*(?<iterator>[a-z_][a-z0-9_]*)\s+in\s+(?<expr>.*?)\s*\)", true);
     AddToken(TemplateTokenType.EndBlock, @"#(end|{end})", true);
     AddToken(TemplateTokenType.If, @"#(if|{if})\s+\((?<expr>.*?)\s*\)", true);
     AddToken(TemplateTokenType.ElseIf, @"#(elseif|{elseif})\s+\((?<expr>.*?)\s*\)", true);
     AddToken(TemplateTokenType.Else, @"#(else|{else})", true);
     AddToken(TemplateTokenType.Expression, @"\${(?<expr>.*?)}", false);
     AddToken(TemplateTokenType.Expression, @"\$(?<expr>[a-zA-Z_][a-zA-Z0-9_\.@]*?)(?![a-zA-Z0-9_\.@])", false);
    }
}

Which is used like this

foreach (Match match in regex.Matches(inputString))
{
    ...

    switch (tokenMatch.TokenType)
    {
     case TemplateTokenType.Expression:
      {
       currentNode.Add(new ExpressionNode(tokenMatch));
      }
      break;

     case TemplateTokenType.ForEach:
      {
       nodeStack.Push(currentNode);

       currentNode = currentNode.Add(new ForEachNode(tokenMatch));
      }
      break;
     ....
    }

    ....
}

It pushes and pops from a Stack to keep state.

Chris S 2009-03-23 12:41:23

Answer 5

+1 A:

It is possible to use Flex and Bison for C#.

A researcher at the University of Ireland has developed a partial implementation that can be found at the following link: Flex/Bison for C#

It could definitely be considered a 'poor mans lexer' as he seems to still have some issues with his implementation, such as no preprocessor, issues with a 'dangling else' case, etc.

espais 2009-03-23 12:57:39

The page has not been updated 2004, and the lexer itself is derived from the C# 0.28 spec. I don't think this "poor man's lexer" should be used in the real world.

Juliet 2009-03-23 14:09:32

That is a good point, however I figured that since he was trying to do something simple, this quick and dirty (and obviously unfinished) lexer would be an OK starting point.

espais 2009-03-24 15:53:40

Answer 6

+1 A:

Malcolm Crowe has a great LEX/YACC implementation for C# here. Works by creating regular expressions for the LEX...

Direct download

Kieron 2009-03-23 13:07:53

FWIW: Link is now dead.

Andrew Song 2010-05-21 20:44:14

I've updated the link with the one from the article.

Kieron 2010-05-21 21:42:15

Answer 7

+6 A:

The original version I posted here as an answer had a problem in that it only worked while there was more than one "Regex" that matched the current expression. That is, as soon as only one Regex matched, it would return a token - whereas most people want the Regex to be "greedy". This was especially the case for things such as "quoted strings".

The only solution that sits on top of Regex is to read the input line-by-line (which means you cannot have tokens that span multiple lines). I can live with this - it is, after all, a poor man's lexer! Besides, it's usually useful to get line number information out of the Lexer in any case.

So, here's a new version that addresses these issues. Credit also goes to this

public interface IMatcher
{
    /// <summary>
    /// Return the number of characters that this "regex" or equivalent
    /// matches.
    /// </summary>
    /// <param name="text">The text to be matched</param>
    /// <returns>The number of characters that matched</returns>
    int Match(string text);
}

class RegexMatcher : IMatcher
{
    private readonly Regex regex;
    public RegexMatcher(string regex)
    {
        this.regex = new Regex(string.Format("^{0}", regex));
    }

    public int Match(string text)
    {
        Match m = regex.Match(text);
        if(m.Success)
            return m.Length;
        return 0;
    }

    public override string ToString()
    {
        return regex.ToString();
    }
}

public class TokenDefinition
{
    public readonly IMatcher Matcher;
    public readonly object Token;

    public TokenDefinition(string regex, object token)
    {
        this.Matcher = new RegexMatcher(regex);
        this.Token = token;
    }
}

public class Lexer : IDisposable
{
    private readonly TextReader reader;
    private readonly TokenDefinition[] tokenDefinitions;

    private string lineRemaining;
    private string tokenContents;
    private object currentToken;
    private int lineNumber = 0;
    private int position = 0;

    public Lexer(TextReader reader, TokenDefinition[] tokenDefinitions)
    {
        this.reader = reader;
        this.tokenDefinitions = tokenDefinitions;
        nextLine();
    }

    private void nextLine()
    {
        do
        {
            lineRemaining = reader.ReadLine();
            ++lineNumber;
            position = 0;
        } while(lineRemaining != null && lineRemaining.Length == 0);
    }

    public bool Next()
    {
        if(lineRemaining == null)
            return false;
        foreach(TokenDefinition def in tokenDefinitions)
        {
            int matched = def.Matcher.Match(lineRemaining);
            if(matched > 0)
            {
                position += matched;
                currentToken = def.Token;
                tokenContents = lineRemaining.Substring(0,matched);
                lineRemaining = lineRemaining.Substring(matched);
                if(lineRemaining.Length == 0)
                    nextLine();

                return true;
            }
        }
        throw new Exception(string.Format("Unable to match against any tokens at line {0} position {1} \"{2}\"",
                                          lineNumber, position, lineRemaining));
    }

    public string TokenContents
    {
        get { return tokenContents; }
    }

    public object Token
    {
        get { return currentToken; }
    }

    public int LineNumber
    {
        get { return lineNumber; }
    }

    public void Dispose()
    {
        reader.Dispose();
    }
}

Example program:

string sample = @"( one (two 456 -43.2 "" \"" quoted"" ))";

var defs = new TokenDefinition[]
{
    // Thanks to [steven levithan][2] for this great quoted string
            // regex
    new TokenDefinition(@"([""'])(?:\\\1|.)*?\1", "QUOTED-STRING"),
    // Thanks to http://www.regular-expressions.info/floatingpoint.html
    new TokenDefinition(@"[-+]?\d*\.\d+([eE][-+]?\d+)?", "FLOAT"),
    new TokenDefinition(@"[-+]?\d+", "INT"),
    new TokenDefinition(@"#t", "TRUE"),
    new TokenDefinition(@"#f", "FALSE"),
    new TokenDefinition(@"[*<>\?\-+/A-Za-z->!]+", "SYMBOL"),
    new TokenDefinition(@"\.", "DOT"),
    new TokenDefinition(@"\(", "LEFT"),
    new TokenDefinition(@"\)", "RIGHT"),
    new TokenDefinition(@"\s", "SPACE")
};

TextReader r = new StringReader(sample);
Lexer l = new Lexer(r, defs);
while (l.Next())
{
    Console.WriteLine("Token: {0} Contents: {1}", l.Token, l.TokenContents);
}

Output:

Token: LEFT Contents: (
Token: SPACE Contents:
Token: SYMBOL Contents: one
Token: SPACE Contents:
Token: LEFT Contents: (
Token: SYMBOL Contents: two
Token: SPACE Contents:
Token: INT Contents: 456
Token: SPACE Contents:
Token: FLOAT Contents: -43.2
Token: SPACE Contents:
Token: QUOTED-STRING Contents: " \" quoted"
Token: SPACE Contents:
Token: RIGHT Contents: )
Token: RIGHT Contents: )

Paul Hollingsworth 2009-03-23 14:46:55

Answer 8

+2 A:

It may be overkill, but have a look at Irony on CodePlex.

Irony is a development kit for implementing languages on .NET platform. It uses the flexibility and power of c# language and .NET Framework 3.5 to implement a completely new and streamlined technology of compiler construction. Unlike most existing yacc/lex-style solutions Irony does not employ any scanner or parser code generation from grammar specifications written in a specialized meta-language. In Irony the target language grammar is coded directly in c# using operator overloading to express grammar constructs. Irony's scanner and parser modules use the grammar encoded as c# class to control the parsing process. See the expression grammar sample for an example of grammar definition in c# class, and using it in a working parser.

Andy Dent 2009-03-23 15:03:53

Ah I see - sounds like C# version of Boost Spirit for C++. Thanks... although as you can see from my answer, definitely overkill for what I'm looking for.

Paul Hollingsworth 2009-03-23 15:08:32

Interesting project indeed, at least you have IDE support in the same way as normal C# has (because grammar becomes just an ordinary C# code :) ). I guess it's a bit like LINQ helps you to stop writing real SQL.

IgorK 2010-01-22 14:08:04

Answer 9

A:

Don't. Use ANTLR.

erikkallen 2010-05-21 21:57:50

ansaurus

tags:

views:

answers:

Poor man's "lexer" for C#

related questions