ansaurus

Question

How to deal with overlapping character groups in different tokens in an EBNF grammar?

Answer 1

+1 A:

Despite the title, this all seems to relate to the scanner, not the parser. I haven't used CoCo/R, so I can't comment on it directly, but in a typical (e.g., lex/Flex) scanner, rules are considered in order, so the rule/pattern that's chosen is the first one that matches. Most scanners I've written include a '.' (i.e., match anything) as their last pattern, to display an error message if there's some input that doesn't match any other rule.

Jerry Coffin 2010-06-15 14:44:08

In CoCo/R you specify the tokens and grammar all in one file. CoCo/R seems to be checking for this ambiguity. I've tried reordering my declarations but haven't seen any difference. I'll try a few more times.

Drew Noakes 2010-06-15 14:59:34

Answer 2

+1 A:

You may want to look into a PEG generator which has context sensitive tokenization.

http://en.wikipedia.org/wiki/Parsing_expression_grammar

I cannot think of a way you will get around this using COCO/R or similar, as each token needs to be unambiguous.

If messages were surrounded by quotes, or some other way of disambiguating then you would not have a problem. I really think PEG may be your answer, as it also has ordered choice (first match).

Also take a look at:

http://tinlizzie.org/ometa/

Andre Artus 2010-06-21 06:22:42

Awesome. This sounds exactly like what I need. I managed to put this off until now, so your answer is timed perfectly. I was considering merging all tokens into a generic 'symbol' definition, but what this sounds much better. Will let you know how I get on. Can you comment upon any potential performance impact?

Drew Noakes 2010-06-21 10:10:45

You may find it it slightly slower, depending on the parser generator. It should be really easy to craft something by hand if speed is a concern. If you can tell me which language/platform you intend to build against (e.g. Java/JVM, C#/.NET, C++) then I may be able to make some recommendations.

Andre Artus 2010-06-21 20:38:14

@Drew: If you can put up a sanitized example of the input you want to process, then that may help too. When I design a DSL I tend to write a few samples first and work back from there (the samples also come to serve as input for some unit tests).

Andre Artus 2010-06-21 20:52:23

@Andre: Actually I'm parsing someone else's format. It's a series of SExpressions. Each SExpression should be turned into a different object type. I asked a different question (http://stackoverflow.com/questions/3051254/) about parsing SExpressions explicitly, as maybe a full-blown grammar isn't necessary for such a simply structured data format. You can see examples of the data here: http://simspark.sourceforge.net/wiki/index.php/Perceptors there are several repeating patterns. For example, `(pol <d> <phi> <theta>)` should map to my `PolarCoordinate` type.

Drew Noakes 2010-06-23 16:10:41

I have a character stream of SExpressions: `(...)(...)(...)...`. Ideally I'd like to process the stream directly, and spit out one object for each expression in the series.

Drew Noakes 2010-06-23 16:13:05

@Drew: What do you want to do when there is incorrect data on the steam? That is, do you need some kind of error recovery, or do you bail out?

Andre Artus 2010-06-23 20:47:42

It will be a bit difficult to describe a possible solution in the commensts, so I might have to either ammend my original answer, or create a new one. The format looks very simple (if it's exactly like the "Perceptors" one). I would not even bother with a parser generator, it going to take longer to sort that out than code the solution by hand.

Andre Artus 2010-06-23 21:10:18

@Drew, what language are you coding in? If it's something I know I may be able to give you code you can use.

Andre Artus 2010-06-23 21:11:30

@Drew: is it possible to start reading the stream in the middle of an incomplete message e.g. "torso) (rt 0.01 0.07 0.46))". That is, is the stream character, or message, based?

Andre Artus 2010-06-23 23:42:34

Hi @Andre, the stream is message based. I read a four-byte length value, then read that many one-byte ASCII characters, then the same again. So it's not possible to start reading from the middle of a stream (good question though), and if something goes wrong in the parsing then recovery would be to fast-forward the appropriate number of bytes and start the next message. Recovery might just move forward to the next top-level SExpression. The data comes over TCP from a server that doesn't have too many surprises in store so I'm not overly concerned about errors in the stream.

Drew Noakes 2010-06-24 03:07:29

Actually we're kind of diverging away from the question above into the other question I posted. If you think you need a second answer, then you might post there (http://stackoverflow.com/questions/3051254/). I'm developing this in C# and already have a parser generator that works, except for the HearPerceptor (http://simspark.sourceforge.net/wiki/index.php/Perceptors#Hear_Perceptor) which has a 20-byte payload of characters ranging [0x20; 0x7E]. I can't make a rule at the token level that covers that range without overlapping with `ident` and `num`.

Drew Noakes 2010-06-24 03:12:30

The project I'm working on is open source. You can see the parser grammar here: http://code.google.com/p/tin-man/source/browse/trunk/TinMan/PerceptorParsing/perceptors.atg One of the reasons I am looking at using a grammar is because there's another SExp format I might need to parse later which is a bit more involved: http://simspark.sourceforge.net/wiki/index.php/Network_Protocol#Server.2FMonitor_Communication BTW you seem to be a parsing expert, and I really appreciate you taking the time to help me with this.

Drew Noakes 2010-06-24 03:19:31

@Drew: It is good to know what your inputs are. I had devised a whole scheme to handle partial data, and now you don't need it :D.I agrree that this is starting to diverge away from the question above, and I will be happy to post the answer in another question. Let me take a look at what you have as there may be ways to sort it out with now that I know what we are dealing with. As to being an expert, I'm actually more of an enthusiast. If you feel that this has run its course then perhaps it's time to close the question. I will post in either the linked Q or a new more specific one.

Andre Artus 2010-06-24 04:05:02

@Drew: You seem to be working on some cool stuff. Is this an AI project?

Andre Artus 2010-06-24 04:06:57

I see you have quotes around your "MessageText" which is not part of the original spec, is this to get around some issues?

Andre Artus 2010-06-24 04:14:32

Is there a reason why you are not wrapping with BufferedStream?

Andre Artus 2010-06-24 04:25:30

The sun is coming up, so it is time for me to go to bed. I will coninue looking at this tonight.

Andre Artus 2010-06-24 04:38:06

@Drew: I posted an answer on the linked question. But come to think of it it may apply here too.

Andre Artus 2010-06-24 05:34:44

Answer 3

A:

Try this:

CHARACTERS

    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
    message = messageChar { messageChar } CONTEXT (")") .

Oh, I have to point out that '\u0020' is the unicode SPACE, which you are subsequently removing with "- ' '". Oh, and you can use CONTEXT (')') if you don't need more than one character lookahead. This does not work in your case seeing as all the tokens above can appear before a ')'.

FWIW: CONTEXT does not consume the enclosed sequence, you must still consume it in your production.

EDIT:

Ok, this seems to work. Really, I mean it this time :)

CHARACTERS
    letter = 'A'..'Z' + 'a'..'z' .
    digit = "0123456789" .
//    messageChar = '\u0020'..'\u007e' - ' ' - '(' - ')'  .

TOKENS

    double = ['-'] digit { digit } [ '.' digit { digit } ] .
    ident = letter { letter | digit | '_' } .
//    message = letter { messageChar } CONTEXT (')') .

// MessageText<out string m> = message               (. m = t.val; .)
// .

HearExpr<out HeardMessage message> =
    (.
        TimeSpan time; 
        Angle direction = Angle.NaN; 
        string messageText = ""; 
    .)
    "(hear" 
    TimeSpan<out time>
        ( "self" | AngleInDegrees<out direction> )
//         MessageText<out messageText>
    {
        ANY (. messageText += t.val; .)
    }
    ')'
    (. 
        message = new HeardMessage(time, direction, new Message(messageText)); 
    .)
    .

ANY will read character until it hits ')' or whitespace. I put it in a loop concatenating each value, but you may not want to do that. You may want to have it in a loop though so that it doesn't return "over" when it sees "over here", but "here". You can do a simple length check on messageText, or other validity checks such as adding t.val to a List and checking the count. Anything really. You can also do a test with a RegEx to make sure it complies with whatever pattern you need to check against.

Andre Artus 2010-06-24 16:49:08

ansaurus

tags:

views:

answers:

How to deal with overlapping character groups in different tokens in an EBNF grammar?

related questions