tags:

views:

318

answers:

3

I work on an open source project focused around Biblical texts. I would like to create a standard string format to build up a search string. I would then need to parse the search string and run the search with the options given. There are a number of different options, from scope of the search, to searching multiple texts, to wildcards, etc.

I'm thinking that using something like lex/yacc to generate a parser for this format might be a good idea. I think the Xapian project uses lemony to achieve a similar goal. My question is, is using one (or more) of these tools the best way to accomplish this?

In addition to the question, I would appreciate any links to resources on these tools (and any others that might be options). The biggest problem I've run into so far is that most of the examples and tutorials are either geared towards a programming language or something simple like a calculator rather than parsing a string format.

A: 

Keep "syntax error diagnosis and message" in your mind uppermost - if the user makes a mistake, a handcrafted recursive-descent-style parser can have some idea based on what it has scanned so far, what mistake the user might have made. If you're going to use an automated tool, be sure to test how it responds to typical user typos - genius-programmers can handle cryptic messages from their compiler, while it sounds like you are targeting a much less sophisticated user who therefore needs a friendlier interface.

pngaz
+1  A: 

Tools like Lex and Yacc are suitable for your purposes. A parser for a search string is not that different from a parser for a programming language (the big difference is that a search string parser generates rules guiding the search, while the programming language parser generates a parse tree from where code is generated)

I assume your syntax will contain rules like the following:

expression : word
           | expression AND expression
           | expression OR expression
           | NOT expression
           | '(' expression ')'

all of which are easy to express in Yacc.

You can look at A Compact Guide to Lex & Yacc which I've found very useful for learning Lex and Yacc

iWerner
Yes, I need rules exactly like that. I'm still unclear how to get from something like this to, eg, populating a data structure with the results of the parse. What's the next step from here?
Matthew Talbert
I'm not asking you to give me step by step instructions, just wanting a reference to a tutorial or example that shows the next step.
Matthew Talbert
The link I provided parses the simple programming language into a tree. In one of the later chapters they use it to implement an interpreter. It should enlighten you
iWerner
Thanks for your help. I'm reading the compact guide now. It helps a lot in understanding what is going on. I'm probably going to end up using bison/flex, but I understand that shouldn't really make any difference.
Matthew Talbert
+1  A: 

If you're trying to build a parser in C++ have a look at

boost::sprit

It certainly is advanced C++, but it will build quite complex and performant parsers from C++ templates without code generation. It took me a few days to get into it, but using and modifying the samples that was straight forward. I also recommend reading the following book:

C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond by David Abrahams and Aleksey Gurtovoy

RED SOFT ADAIR
I saw spirit. One thing I'm not clear on is whether the result would depend on any boost dll's. I'm not very interested in adding another dependency if not absolutely necessary.
Matthew Talbert
No - no DLL or library needed. As most of the boost librarys spirit is a header only library. There is nothing you'll have to link.
RED SOFT ADAIR
It appears if you use regex matching, which I would want, then you have to use the boost regex library which would require linking as it is not header only. Thanks for the info, but I think I'm going to try a different solution.
Matthew Talbert