views:

64

answers:

3

I'm using an LL(k) EBNF grammar to parse a character stream. I need three different types of tokens:

CHARACTERS

  letter = 'A'..'Z' + 'a'..'z' .
  digit = "0123456789" .
  messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')' .

TOKENS

  num = ['-'] digit { digit } [ '.' digit { digit } ] .
  ident = letter { letter | digit | '_' } .
  message = messageChar { messageChar } .

The first two token declarations are fine, because they don't share any common characters.

However the third, message, is invalid because it's possible that some strings could be both num and message (such as "123"), and other strings could be both an ident and a message (such as "Hello"). Hence, the tokenizer can't differentiate correctly.

Another example is differentiating between integers and real numbers. Unless you require all real numbers to have at least one decimal place (meaning 1 would need to be encoded as 1.0, which isn't an option for me) then I can't get support in the grammar for the differences between these two numeric types. I've had to go for all values being expressed as reals and doing the checking after the point. That's fine, but sub-optimal. My real problem is with the message token. I can't find a workaround for that.

So the question is, can I do this with an LL(k) EBNF grammar? I'm using CoCo/R to generate the parser and scanner.

If I can't do it with LL(k) EBNF, then what other options might I look into?

EDIT This is the output I get from CoCo/R:

Coco/R (Apr 23, 2010)
Tokens double and message cannot be distinguished
Tokens ident and message cannot be distinguished
...
9 errors detected
+1  A: 

Despite the title, this all seems to relate to the scanner, not the parser. I haven't used CoCo/R, so I can't comment on it directly, but in a typical (e.g., lex/Flex) scanner, rules are considered in order, so the rule/pattern that's chosen is the first one that matches. Most scanners I've written include a '.' (i.e., match anything) as their last pattern, to display an error message if there's some input that doesn't match any other rule.

Jerry Coffin
In CoCo/R you specify the tokens and grammar all in one file. CoCo/R seems to be checking for this ambiguity. I've tried reordering my declarations but haven't seen any difference. I'll try a few more times.
Drew Noakes
+1  A: 

You may want to look into a PEG generator which has context sensitive tokenization.

http://en.wikipedia.org/wiki/Parsing_expression_grammar

I cannot think of a way you will get around this using COCO/R or similar, as each token needs to be unambiguous.

If messages were surrounded by quotes, or some other way of disambiguating then you would not have a problem. I really think PEG may be your answer, as it also has ordered choice (first match).

Also take a look at:

http://tinlizzie.org/ometa/

Andre Artus
Awesome. This sounds exactly like what I need. I managed to put this off until now, so your answer is timed perfectly. I was considering merging all tokens into a generic 'symbol' definition, but what this sounds much better. Will let you know how I get on. Can you comment upon any potential performance impact?
Drew Noakes
You may find it it slightly slower, depending on the parser generator. It should be really easy to craft something by hand if speed is a concern. If you can tell me which language/platform you intend to build against (e.g. Java/JVM, C#/.NET, C++) then I may be able to make some recommendations.
Andre Artus
@Drew: If you can put up a sanitized example of the input you want to process, then that may help too. When I design a DSL I tend to write a few samples first and work back from there (the samples also come to serve as input for some unit tests).
Andre Artus
@Andre: Actually I'm parsing someone else's format. It's a series of SExpressions. Each SExpression should be turned into a different object type. I asked a different question (http://stackoverflow.com/questions/3051254/) about parsing SExpressions explicitly, as maybe a full-blown grammar isn't necessary for such a simply structured data format. You can see examples of the data here: http://simspark.sourceforge.net/wiki/index.php/Perceptors there are several repeating patterns. For example, `(pol <d> <phi> <theta>)` should map to my `PolarCoordinate` type.
Drew Noakes
I have a character stream of SExpressions: `(...)(...)(...)...`. Ideally I'd like to process the stream directly, and spit out one object for each expression in the series.
Drew Noakes
@Drew: What do you want to do when there is incorrect data on the steam? That is, do you need some kind of error recovery, or do you bail out?
Andre Artus
It will be a bit difficult to describe a possible solution in the commensts, so I might have to either ammend my original answer, or create a new one. The format looks very simple (if it's exactly like the "Perceptors" one). I would not even bother with a parser generator, it going to take longer to sort that out than code the solution by hand.
Andre Artus
@Drew, what language are you coding in? If it's something I know I may be able to give you code you can use.
Andre Artus
@Drew: is it possible to start reading the stream in the middle of an incomplete message e.g. "torso) (rt 0.01 0.07 0.46))". That is, is the stream character, or message, based?
Andre Artus
Hi @Andre, the stream is message based. I read a four-byte length value, then read that many one-byte ASCII characters, then the same again. So it's not possible to start reading from the middle of a stream (good question though), and if something goes wrong in the parsing then recovery would be to fast-forward the appropriate number of bytes and start the next message. Recovery might just move forward to the next top-level SExpression. The data comes over TCP from a server that doesn't have too many surprises in store so I'm not overly concerned about errors in the stream.
Drew Noakes
Actually we're kind of diverging away from the question above into the other question I posted. If you think you need a second answer, then you might post there (http://stackoverflow.com/questions/3051254/). I'm developing this in C# and already have a parser generator that works, except for the HearPerceptor (http://simspark.sourceforge.net/wiki/index.php/Perceptors#Hear_Perceptor) which has a 20-byte payload of characters ranging [0x20; 0x7E]. I can't make a rule at the token level that covers that range without overlapping with `ident` and `num`.
Drew Noakes
The project I'm working on is open source. You can see the parser grammar here: http://code.google.com/p/tin-man/source/browse/trunk/TinMan/PerceptorParsing/perceptors.atg One of the reasons I am looking at using a grammar is because there's another SExp format I might need to parse later which is a bit more involved: http://simspark.sourceforge.net/wiki/index.php/Network_Protocol#Server.2FMonitor_Communication BTW you seem to be a parsing expert, and I really appreciate you taking the time to help me with this.
Drew Noakes
@Drew: It is good to know what your inputs are. I had devised a whole scheme to handle partial data, and now you don't need it :D.I agrree that this is starting to diverge away from the question above, and I will be happy to post the answer in another question. Let me take a look at what you have as there may be ways to sort it out with now that I know what we are dealing with. As to being an expert, I'm actually more of an enthusiast. If you feel that this has run its course then perhaps it's time to close the question. I will post in either the linked Q or a new more specific one.
Andre Artus
@Drew: You seem to be working on some cool stuff. Is this an AI project?
Andre Artus
I see you have quotes around your "MessageText" which is not part of the original spec, is this to get around some issues?
Andre Artus
Is there a reason why you are not wrapping with BufferedStream?
Andre Artus
The sun is coming up, so it is time for me to go to bed. I will coninue looking at this tonight.
Andre Artus
@Drew: I posted an answer on the linked question. But come to think of it it may apply here too.
Andre Artus
A: 

Try this:

CHARACTERS

    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
    message = messageChar { messageChar } CONTEXT (")") .

Oh, I have to point out that '\u0020' is the unicode SPACE, which you are subsequently removing with "- ' '". Oh, and you can use CONTEXT (')') if you don't need more than one character lookahead. This does not work in your case seeing as all the tokens above can appear before a ')'.

FWIW: CONTEXT does not consume the enclosed sequence, you must still consume it in your production.

EDIT:

Ok, this seems to work. Really, I mean it this time :)

CHARACTERS
    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
//    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
//    message = letter { messageChar } CONTEXT (')') .

// MessageText<out string m> = message               (. m = t.val; .)
// .

HearExpr<out HeardMessage message> =
    (.
        TimeSpan time; 
        Angle direction = Angle.NaN; 
        string messageText = ""; 
    .)
    "(hear" 
    TimeSpan<out time>
        ( "self" | AngleInDegrees<out direction> )
//         MessageText<out messageText>
    {
        ANY (. messageText += t.val; .)
    }
    ')'
    (. 
        message = new HeardMessage(time, direction, new Message(messageText)); 
    .)
    .

ANY will read character until it hits ')' or whitespace. I put it in a loop concatenating each value, but you may not want to do that. You may want to have it in a loop though so that it doesn't return "over" when it sees "over here", but "here". You can do a simple length check on messageText, or other validity checks such as adding t.val to a List and checking the count. Anything really. You can also do a test with a RegEx to make sure it complies with whatever pattern you need to check against.

Andre Artus