tags:

views:

1495

answers:

6

I'm looking for a good parser generator that I can use to read a custom text-file format in our large commercial app. Currently this particular file format is read with a handmade recursive parser but the format has grown and complexified to the point where that approach has become unmanageable.

It seems like the ultimate solution would be to build a proper grammar for this format and then use a real parser generator like yacc to read it, but I'm having trouble deciding which such generator to use or even if they're worth the trouble at all. I've looked at ANTLR and Spirit, but our project has specific constraints beyond earlier answers that make me wonder if they're as appropriate for us. In particular, I need:

  • A parser that generates C or C++ code with MSVC. ANTLR 3 doesn't support C++; it claims to generate straight C but the docs on getting it to actually work are sort of confusing.
  • Severely constrained memory usage. Memory is at a huge premium in our app and even tiny leaks are fatal. I need to be able to override the parser's memory allocator to use our custom malloc(), or at the very least I need to give it a contiguous pool from which it draws all its memory (and which I can deallocate en bloc afterwards). I can spare about 200kb for the parser executable itself, but whatever dynamic heap it allocates in parsing has to get freed afterwards.
  • Good performance. This is less critical but we ought to be able to parse 100kb of text in no more than a second on a 3ghz processor.
  • Must be GPL-free. We can't use GNU code.

I like ANTLRworks' IDE and debugging tools, but it looks like getting its C target to actually work with our app will be a huge undertaking. Before I embark on that palaver, is ANTLR the right tool for this job?

The text format in question looks something like:

attribute "FluxCapacitance"  real constant

asset DeLorean
{
    //comment foo bar baz
    model "delorean.mdl"
    animation "gullwing.anm"
    references "Marty"
    loadonce
}

template TimeMachine
{
    attribute FluxCapacitance 10      
    asset DeLorean
}
A: 

ANTLR parsers, and in fact any parser built with something LALR or the like, tend to be big. Do you have an actual grammar for this? It looks like it might be most readily parsed with a hand-written recursive-descent parser, but it's not much of a sample.

Oops, my mistake, as ANTLR apparently generates recursive-descent. Still, I've had problems with ANTLR generating big parsers.

Charlie Martin
It's only a tiny sample -- we do have a hand written parser for the current format, but it's become so complicated that maintenance and adding new features is an abject nightmare. I'm wondering whether parsing it with ANTLR would be easier, or harder still.I've built an LR grammar for our format in ANTLRworks and it seems to syntax-check okay, but the next huge step would be to actually hook up C actions and link it into our program.
Crashworks
How big is the generated parser? The ugly part is usually the parser structure itself.
Charlie Martin
You know, you just convinced me to move teh ANTLR book up my to-read stack. ANTLR v3 might solve some of the problems I was seeing in 2005-2006.
Charlie Martin
+1  A: 

We use Boost Spirit successfully in our application. The Boost license is a very liberal one, so there is no problem using it in commercial applications.

Quote from the documentation:

Spirit is an object-oriented recursive-descent parser generator framework implemented using template meta-programming techniques. Expression templates allow us to approximate the syntax of Extended Backus-Normal Form (EBNF) completely in C++. The Spirit framework enables a target grammar to be written exclusively in C++. Inline EBNF grammar specifications can mix freely with other C++ code and, thanks to the generative power of C++ templates, are immediately executable. In retrospect, conventional compiler-compilers or parser-generators have to perform an additional translation step from the source EBNF code to C or C++ code.

lothar
A: 

Then why don't you use flex/yacc? It generates C code,can be run from MSVC, was developped with efficiency in mind, can have malloc overriden (google for yymalloc), they are themselves GPL, but the resulting code (the code you use in your project) AFAIK not.

Or use a hand-made parser.

jpalecek
A: 

ANTLR 3 doesn't support C++; it claims to generate straight C but the docs on getting it to actually work are sort of confusing.

It does generate C, and furthermore, it works with Visual Studio and C++. I know this because I've done it before and submitted a patch to get it to work with stdcall.

Memory is at a huge premium in our app and even tiny leaks are fatal. I need to be able to override the parser's memory allocator to use our custom malloc(), or at the very least I need to give it a contiguous pool from which it draws all its memory (and which I can deallocate en bloc afterwards). I can spare about 200kb for the parser executable itself, but whatever dynamic heap it allocates in parsing has to get freed afterwards.

The antlr3c runtime, last time I checked does not have a memory leak, and uses the Memory pool paradigm which you describe. However, it does have one shortcoming in the API which the author refuses to change, which is that if you request the string of a node, it will create a new copy each time until you free the entire parser.

I have no comment on the ease of using a custom malloc, but it does have a macro to define what malloc function to use in the entire project.

As for the executable size, my compilation was about 100 kb in size including a small interpreter.

My suggestion to you is to keep learning ANTLR, because it still fits your requirements and you probably need to sacrifice a little more time before it will start working for you.

Unknown
A: 

Realistically, if your grammar is relatively small, and doesn't contain many ambiguities or parsing complexities then it shouldn't matter if you use a recursive descent parser or a shift-reduce parser.

I would say look at ANTLR and Spirit, also look at Flex and Bison. There are other, less known parsers like Coco/R (contains generators for a lot of languages, including C++) also.

Bison's no good; we must be GPL-free.
Crashworks
A: 

A hand-coded recursive descent parser is actually quite fast and can be very compact. The only downside is you have to be careful to code essentally LL(1) grammars.

You can hand code such parsers as plain recursive C code. If you are really tight on space, you can define a parsing virtual machine, and build a tiny C interpreter to run it. I used to build BASIC interpreters this way back the 70s.

The ideas came from a 1964 paper on metacompilers by Val Schorre, who shows how to build complete compilers in 10 pages Shorres tiny parser generator that does pretty good recursive descent parsers. A site describing this paper and showing precisely how to build such parsers can be found at http://www.bayfronttechnologies.com/metaii.html

Ira Baxter