views:

224

answers:

3

I'm building a custom expression parser and evaluator for production environment to provide a limited DSL to the users. The parser itself as the DSL, need to be simple. The parser is going to be built in an exotic language that doesn't support dynamic expression parsing nor has any parser generator tools available.

My current decision is to go for recursive descent approach with LL(1) grammar, so that even programmers with no previous experience in evaluating expression could quickly learn how the code works.

It has to handle mixed expressions made up of several data types: decimals, percentages, strings and dates. And dates in the format of dd/mm/yyyy are easy to confuse with a string of division ops.

Is where a good solution to this problem?

My own solution that is aimed at keeping the parser simple and involves prefixing dates with a special symbol, let's say apostrophe:

<date>   ::= <apostr><digit><digit>/<digit><digit>/<digit><digit><digit><digit>

<apostr> ::= '

<digit>  ::= '0'..'9'
+1  A: 

First off, I'm a fan of LL parsers, so I approve of your approach heartily. Note that one of the newer popular parser generators (ANTLR) is LL. If you allow more look-ahead, rather that restricting yourself to LL(1), you can do pretty much anything you'd ever want to do with an LR(1) parser, but the code will be far clearer, more reliable, and easier to debug.

I don't know enough about your overall grammar to be able to tell. It is possible you might be able to design things so that the LL parser can always tell from context if it is an integer expression or a date constant. However, assuming you can't, yeah you'd need some kind of way to tell the difference. The only other thing I can think of would be to use backslash as a separator instead of slash, but that's kinda ugly.

T.E.D.
T.E.D thanks, appreciated! MS Excel, for example, in this case will infer the type from context, i.e. writting =01/01/2010 in a cell will result in 0.000497512. However setting explicitly or implicitly cell data type to date will proceduce a date. But I feel that such type inference magic will add a lot of complexity to the parser-evaluator and might confuse both users and programmers responsible for maintenance of the parser (based on the assumptions they haven't worked on parsers before).
Totophil
+1  A: 

An LL-like lexerless parser with an infinite lookahead is what you need. And, namely, it is PEG.

http://en.wikipedia.org/wiki/Parsing_expression_grammar

With an ordered choice it is quite easy to avoid this date vs. constant literals division confusion.

SK-logic
I also would have figured that an ordered choice would be the cleanest (and possibly only) solution here.
Joe Carnahan
I'm not so sure. It is quite possible his grammar would be ambiguous in some cases without adding that apostrophe at the front of dates. PEG's are powerful, but they can't handle ambiguous grammars.
T.E.D.
They can resolve ambiguities. If something looks like a date, then it's parsed first, as a date. If then it is used in a context where something else is expected, it's backtracked and reparsed the other way. E.g., 1/2/3 will be a date, but 1/2/3/4 will be 1 divided by 2 divided by 3 etc.
SK-logic
A: 

When a language is intended for human input, defining it is as much a matter of

  • adding syntax constraints to ensure unambiguous and easy parsing
  • removing/bending syntax to ensure that the language feels intuitive, "natural", to the intended human audience.

Satisfying the second requirement is much harder than the first one, and requires insight into the

  • intended use cases of the language
    Which type of keyboard/input device is available? Are there some characters among the allowed characters which are hard to produce or to see on the display?
    Which tokens / expression will be frequently used, which will be required only occasionally? Are users frequently inputting short, ad-hoc code snippets, or are the programs meant to be reused and modified over long periods
    ... etc.
  • background/culture of the intended audience
    Which common practices/idioms from other regular (and plain natural) languages can or should be reused if possible?
    Should one favor a terse but cryptic style, or a more explicit, but more verbose style?
    ... etc.

Basically, it is hard to make suggestions about a language syntax, without a good grasp of the intended usage and users.
Nevertheless, I'd like to suggest the following for the date format question:

Use an alternative format for date values altogether; one that would be "natural" enough to users but distinctive enough that a regular grammar can describe.
For example, one that uses a 3 letters abbreviation for the month (downside DSL becomes tied to English or other tongue, but also advantage, the ambiguity for humans about which is day and which is month is removed). Tentatively:

  dd-mmm-yyyy    (may seem unnatural in cultures where the prevailing date order 
                  starts with the month maybe yyyy-mmm-dd then ?)
  mmm-dd-yyyy    (better for the above mentioned cultures)
  ddmmmyyyy      (avoid the dashes, but impose leading zeros)

  MnnDnnYyyyy    (using "M", "D" and "Y" (or others) as explicit prefixes; now, 
                  this is completely culture neutral, but maybe a bit awkward...)

Anyway, just ideas... Applicability will vary with the human/cultural factors mentioned, and with the rest of the syntax. For example the above may imply that variables be explicitly marked (that's one of the reasons many languages use the $ prefix for example), to avoid possible conflict with [odd, but possible] variable identifiers.

In a nutshell the idea is to substitute the need for a special character prefix (which may then collide with the use these characters for mathematical and other expressions), by making the 12 months tag an good enough discriminator for the parser.

mjv