tags:

views:

224

answers:

2

I need a parser for a language that isn't all that complicated. The only catch is, I never want the parser to raise an error when it receives malformed input. Rather, I want it to just continue, making as much sense of the input as it can, similar to what the HTML parsers in a web browser do.

I naturally thought to use a parser generator, but I don't have that much experience with them, and all of the examples I've seen of Bison and Antlr feature brittle parsers that give up as soon as they encounter a syntax error. Is this doable with them, or should I consider hand-rolling it? It probably (I think) wouldn't be that difficult, given the language.

+1  A: 

I'm not sure you will be able to do quite that with a parser generator in the general case. At least, not in a completely automatic way. Consider the following malformed expression:

a - b + c ) * d

What should a parser even do with this? When it encounters the first illegal token, the close parenthesis, It might somehow be able to guess that the user intended to have an open parenthesis somewhere, but where? Any place it could put it would result in a different value.

Instead, It could just pretend that everything that came before never happened. It would then end up with

* d

Which is still malformed.

TokenMacGuy
You bring up a good point, but my first thought, as with your example, is to drop the close parenthesis (and any subsequent illegal tokens - i.e. loop until you find a valid one), which, while maybe not what the user intended, makes it properly formed. Alternatively, you could drop the whole statement, like dropping a malformed tag in HTML. Depending on the problem, it could be useful. I'm not familiar with parser generators, but I can picture how a parser might implement it, so I'd imagine it's feasible, though maybe not with those tools.
T.R.
+1  A: 

You need to design your grammar with checkpoints in mind. By checkpoints, I mean the semicolon for C, a line break for Python or a period for COBOL (as examples). This checkpointing is how many compilers recover so that they can report more than just the first error found.

I've not used Bison but YACC allows you to override the error handling and I would hope the equivalent GNU tool was at least as powerful as our old UNIX clunkers.

I've done this before with a config file YACC grammar. Say you have the following correctly formed segment:

item = "bread" {
    quantity = 7
    price = 1.50
    taxrate = 10
}

and for some bizarre reason, the user mis-spells "quantity", rendering it incorrect. At that point in your callbacks, you could just raise an error flag which would prevent further processing until the checkpoint was reached. You let the parser keep going (catching and ignoring further errors) and make sure your callbacks don't do anything in response to any spurious successes in the damaged syntax.

This could be by simply ignoring all further stanzas up to the closing brace or even by setting a default value for price and only ignoring up to the line break (so that you at least get a partially formed object).

However you do it, just reset the error flag when you get to the checkpoint so you can continue processing.

I'd still make sure that the user was notified, it's sometimes considered bad form to continue with data the customer didn't want :-).

paxdiablo