ansaurus

Question

Parsing Lisp S-Expressions with known schema in C#

Answer 1

A:

Consider using Ragel. It's a state machine compiler and produces reasonably fast code.

It may not be apparent from the home page, but Ragel does have C# support. Here's a trivial example of how to use it in C#

FrederikB 2010-06-16 08:45:13

Answer 2

A:

Look at gplex and gppg.

Alternatively, you can trivially translate the S-expressions to XML and let .NET do the rest.

leppie 2010-06-16 08:58:26

Answer 3

A:

Drew, perhaps you should add some context to the question, otherwise this answer will make no sense to other users, but try this:

CHARACTERS

    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
    message = messageChar { messageChar } CONTEXT (")") .

Oh, I have to point out that '\u0020' is the unicode SPACE, which you are subsequently removing with "- ' '". Oh, and you can use CONTEXT (')') if you don't need more than one character lookahead.

FWIW: CONTEXT does not consume the enclosed sequence, you must still consume it in your production.

EDIT:

Ok, this seems to work. Really, I mean it this time :)

CHARACTERS
    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
//    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
//    message = letter { messageChar } CONTEXT (')') .

// MessageText<out string m> = message               (. m = t.val; .)
// .

HearExpr<out HeardMessage message> = (. TimeSpan time; Angle direction = Angle.NaN; string messageText; .)
    "(hear" 
        TimeSpan<out time>
        ( "self" | AngleInDegrees<out direction> )
// MessageText<out messageText>    // REMOVED    
{ ANY } (. messageText = t.val; .) // MOD
    ')' (. message = new HeardMessage(time, direction, new Message(messageText)); .)
    .

Andre Artus 2010-06-24 05:10:28

To paraphrase Knuth: "I have only proved it correct, not tested it."

Andre Artus 2010-06-24 05:24:02

As an aside: if you have tokens that cannot be resolved with CONTEXT, then you can leave out the right hand side and handle it in code.

Andre Artus 2010-06-24 05:27:37

Pat Terry has made some mods to CoCo/R that includes the ability to use friendly "user names" for tokens. This is handy if you want to hand roll some parts of the scanner.

Andre Artus 2010-06-24 05:29:11

http://www.scifac.ru.ac.za/resourcekit/

Andre Artus 2010-06-24 05:31:19

@Andre, thanks for this detailed answer. I left out the context of the problem as really I'd like to operate on a character stream directly that doesn't require loading it all into memory in order to parse it. That might be an artificial limitation of CoCo/R however. This answer is actually more appropriate for my other question! I'll try out what you're suggesting. It still doesn't enforce the restriction of character ranges (I presume `ANY` means what it says) , but it is an adequate workaround for this case.

Drew Noakes 2010-06-24 15:56:29

What I was hoping for in this question was a simpler approach to parsing SExpressions than using a grammar file. Given that SExpressions are so regular in structure, and in my case I have a schema defined on top of that too, I hoped there might be a nice solution out there in the wild.

Drew Noakes 2010-06-24 15:57:47

Ok, cool. I'll move this to the other question, and then change this question to the answer I originally came up with --which processes the message stream as it comes in.

Andre Artus 2010-06-24 16:36:54

I moved it to the linked question, with some mods.

Andre Artus 2010-06-24 16:53:15

I added an example to the original question.

Drew Noakes 2010-06-24 18:02:13

Answer 4

A:

In my opinion a parse generator is unneccessary to parse simple S-expressions consisting only of lists, numbers and symbols. A hand-written recursive descent parser is probably simpler and at least as fast. The general pattern would look like this (in java, c# should be very similar):

Object readDatum(PushbackReader in) {
    int ch = in.read();
    return readDatum(in, ch);
}
Object readDatum(PushbackReader in, int ch) {
    if (ch == '(')) {
        return readList(in, ch);
    } else if (isNumber(ch)) {
        return readNumber(in, ch);
    } else if (isSymbolStart(ch)) {
        return readSymbol(in, ch);
    } else {
        error(ch);
    }
}
List readList(PushbackReader in, int lookAhead) {
    if (ch != '(') {
        error(ch);
    }
    List result = new List();
    while (true) {
        int ch = in.read();
        if (ch == ')') {
            break;
        } else if (isWhiteSpace(ch)) {
            skipWhiteSpace(in);
        } else {
            result.append(readDatum(in, ch);
        }
    }
    return result;
}
String readSymbol(PushbackReader in, int ch) {
    StringBuilder result = new StringBuilder();
    result.append((char)ch);
    while (true) {
       int ch2 = in.read();
       if (isSymbol(ch2)) {
           result.append((char)ch2);
       } else if (isWhiteSpace(ch2) || ch2 == ')') {
           in.unread(ch2);
           break;
       } else if (ch2 == -1) {
           break;
       } else {
           error(ch2);
       }
    }
    return result.toString();
}

Jörn Horstmann 2010-06-24 13:21:05

Yes I agree that this is simple and fast, but at the end of it I have a tree of strings and numbers. What I really want is a 1-D list of my own object types. I have a schema of possible SExpressions I'll see, and they should map to object types for deserialisation. I was hoping for a technique whereby I specify this mapping somehow, then pump in a character stream and suck out appropriate corresponding objects of different types.

Drew Noakes 2010-06-24 15:59:59

I added an example to the original question.

Drew Noakes 2010-06-24 16:08:20

ansaurus

tags:

views:

answers:

Parsing Lisp S-Expressions with known schema in C#

related questions