views:

128

answers:

4

I'm working with a service that provides data as a Lisp-like S-Expression string. This data is arriving thick and fast, and I want to churn through it as quickly as possible, ideally directly on the byte stream (it's only single-byte characters) without any backtracking. These strings can be quite lengthy and I don't want the GC churn of allocating a string for the whole message.

My current implementation uses CoCo/R with a grammar, but it has a few problems. Due to the backtracking, it assigns the whole stream to a string. It's also a bit fiddly for users of my code to change if they have to. I'd rather have a pure C# solution. CoCo/R also does not allow for the reuse of parser/scanner objects, so I have to recreate them for each message.

Conceptually the data stream can be thought of as a sequence of S-Expressions:

(item 1 apple)(item 2 banana)(item 3 chainsaw)

Parsing this sequence would create three objects. The type of each object can be determined by the first value in the list, in the above case "item". The schema/grammar of the incoming stream is well known.

Before I start coding I'd like to know if there are libraries out there that do this already. I'm sure I'm not the first person to have this problem.


EDIT

Here's a little more detail on what I want as I think the original question may have been a little vague.

Given some SExpressions, such as:

(Hear 12.3 HelloWorld)
(HJ LAJ1 -0.42)
(FRP lf (pos 2.3 1.7 0.4))

I want a list of objects equivalent to this:

{
    new HearPerceptorState(12.3, "HelloWorld"),
    new HingeJointState("LAJ1", -0.42),
    new ForceResistancePerceptorState("lf", new Polar(2.3, 1.7, 0.4))
}

The actual data set I'm working on is a list of perceptors from a robot model in the RoboCup 3D simulated soccer league. I may potentially also need to deserialise another set of related data with a more complex structure.

A: 

Consider using Ragel. It's a state machine compiler and produces reasonably fast code.

It may not be apparent from the home page, but Ragel does have C# support. Here's a trivial example of how to use it in C#

FrederikB
A: 

Look at gplex and gppg.

Alternatively, you can trivially translate the S-expressions to XML and let .NET do the rest.

leppie
A: 

Drew, perhaps you should add some context to the question, otherwise this answer will make no sense to other users, but try this:

CHARACTERS

    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
    message = messageChar { messageChar } CONTEXT (")") .

Oh, I have to point out that '\u0020' is the unicode SPACE, which you are subsequently removing with "- ' '". Oh, and you can use CONTEXT (')') if you don't need more than one character lookahead.

FWIW: CONTEXT does not consume the enclosed sequence, you must still consume it in your production.

EDIT:

Ok, this seems to work. Really, I mean it this time :)

CHARACTERS
    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
//    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
//    message = letter { messageChar } CONTEXT (')') .

// MessageText<out string m> = message               (. m = t.val; .)
// .

HearExpr<out HeardMessage message> = (. TimeSpan time; Angle direction = Angle.NaN; string messageText; .)
    "(hear" 
        TimeSpan<out time>
        ( "self" | AngleInDegrees<out direction> )
// MessageText<out messageText>    // REMOVED    
{ ANY } (. messageText = t.val; .) // MOD
    ')' (. message = new HeardMessage(time, direction, new Message(messageText)); .)
    .
Andre Artus
To paraphrase Knuth: "I have only proved it correct, not tested it."
Andre Artus
As an aside: if you have tokens that cannot be resolved with CONTEXT, then you can leave out the right hand side and handle it in code.
Andre Artus
Pat Terry has made some mods to CoCo/R that includes the ability to use friendly "user names" for tokens. This is handy if you want to hand roll some parts of the scanner.
Andre Artus
http://www.scifac.ru.ac.za/resourcekit/
Andre Artus
@Andre, thanks for this detailed answer. I left out the context of the problem as really I'd like to operate on a character stream directly that doesn't require loading it all into memory in order to parse it. That might be an artificial limitation of CoCo/R however. This answer is actually more appropriate for my other question! I'll try out what you're suggesting. It still doesn't enforce the restriction of character ranges (I presume `ANY` means what it says) , but it is an adequate workaround for this case.
Drew Noakes
What I was hoping for in this question was a simpler approach to parsing SExpressions than using a grammar file. Given that SExpressions are so regular in structure, and in my case I have a schema defined on top of that too, I hoped there might be a nice solution out there in the wild.
Drew Noakes
Ok, cool. I'll move this to the other question, and then change this question to the answer I originally came up with --which processes the message stream as it comes in.
Andre Artus
I moved it to the linked question, with some mods.
Andre Artus
I added an example to the original question.
Drew Noakes
A: 

In my opinion a parse generator is unneccessary to parse simple S-expressions consisting only of lists, numbers and symbols. A hand-written recursive descent parser is probably simpler and at least as fast. The general pattern would look like this (in java, c# should be very similar):

Object readDatum(PushbackReader in) {
    int ch = in.read();
    return readDatum(in, ch);
}
Object readDatum(PushbackReader in, int ch) {
    if (ch == '(')) {
        return readList(in, ch);
    } else if (isNumber(ch)) {
        return readNumber(in, ch);
    } else if (isSymbolStart(ch)) {
        return readSymbol(in, ch);
    } else {
        error(ch);
    }
}
List readList(PushbackReader in, int lookAhead) {
    if (ch != '(') {
        error(ch);
    }
    List result = new List();
    while (true) {
        int ch = in.read();
        if (ch == ')') {
            break;
        } else if (isWhiteSpace(ch)) {
            skipWhiteSpace(in);
        } else {
            result.append(readDatum(in, ch);
        }
    }
    return result;
}
String readSymbol(PushbackReader in, int ch) {
    StringBuilder result = new StringBuilder();
    result.append((char)ch);
    while (true) {
       int ch2 = in.read();
       if (isSymbol(ch2)) {
           result.append((char)ch2);
       } else if (isWhiteSpace(ch2) || ch2 == ')') {
           in.unread(ch2);
           break;
       } else if (ch2 == -1) {
           break;
       } else {
           error(ch2);
       }
    }
    return result.toString();
}
Jörn Horstmann
Yes I agree that this is simple and fast, but at the end of it I have a tree of strings and numbers. What I really want is a 1-D list of my own object types. I have a schema of possible SExpressions I'll see, and they should map to object types for deserialisation. I was hoping for a technique whereby I specify this mapping somehow, then pump in a character stream and suck out appropriate corresponding objects of different types.
Drew Noakes
I added an example to the original question.
Drew Noakes