views:

321

answers:

5

Most of the posts that I read pertaining to these utilities usually suggest using some other method to obtain the same effect. For example, questions mentioning these tools usual have at least one answer containing some of the following:

  • Use the boost library (insert appropriate boost library here)
  • Don't create a DSL use (insert favorite scripting language here)
  • Antlr is better

Assuming the developer ...

  • ... is comfortable with the C language
  • ... does know at least one scripting language (e.g., Python, Perl, etc.)
  • ... must write some parsing code in almost every project worked on

So my questions are:

  • What are appropriate situations which are well suited for these utilities?
  • Are there any (reasonable) situations where there is not a better alternative to a problem than yacc and lex (or derivatives)?
  • How often in actual parsing problems can one expect to run into any short comings in yacc and lex which are better addressed by more recent solutions?
  • For a developer which is not already familiar with these tools is it worth it for them to invest time in learning their syntax/idioms? How do these compare with other solutions?
A: 

In a previous project, I needed a way to be able to generate queries on arbitrary data in a way that was easy for a relatively non-technical person to be able to use. The data was CRM-type stuff (so First Name, Last Name, Email Address, etc) but it was meant to work against a number of different databases, all with different schemas.

So I developed a little DSL for specifying the queries (e.g. [FirstName]='Joe' AND [LastName]='Bloggs' would select everybody called "Joe Bloggs"). It had some more complicated options, for example there was the "optedout(medium)" syntax which would select all people who had opted-out of receiving messages on a particular medium (email, sms, etc). There was "ingroup(xyz)" which would select everybody in a particular group, etc.

Basically, it allowed us to specify queries like "ingroup('GroupA') and not ingroup('GroupB')" which would be translated to an SQL query like this:

SELECT
    *
FROM
    Users
WHERE
    Users.UserID IN (SELECT UserID FROM GroupMemberships WHERE GroupID=2) AND
    Users.UserID NOT IN (SELECT UserID GroupMemberships WHERE GroupID=3)

(As you can see, the queries aren't as effecient as possible, but that's what you get with machine generation, I guess).

I didn't use flex/bison for it, but I did use a parser generator (the name of which has escaped me at the moment...)

Dean Harding
Did you by chance use ANTLR?
kyoryu
Yeah, that might've been it...
Dean Harding
A: 

I think it's pretty good advice to eschew the creation of new languages just to support a Domain specific language. It's going to be a better use of your time to take an existing language and extend it with domain functionality.

If you are trying to create a new language for some other reason, perhaps for research into language design, then these tools are a bit outdated. Newer generators such as antlr, or even newer implementation languages like ML, make language design a much easier affair.

If there's a good reason to use these tools, it's probably because of their legacy. You might already have a skeleton of a language you need to enhance, which is already implemented in one of these tools. You might also benefit from the huge volumes of tutorial information written about these old tools, for which there is not so great a corpus written for newer and slicker ways of implementing languages.

TokenMacGuy
A: 

We have a whole programming language implemented in my office. We use it for that. I think it's meant to be a quick and easy way to write interpreters for things. You could conceivably write almost any sort of text parser using them, but a lot of times it's either A) easier to write it yourself quick or B) you need more flexibility than they provide.

jdizzle
+1  A: 

Whether it's worth learning these tools or not will depend heavily (almost entirely on how much parsing code you write, or how interested you are in writing more code on that general order. I've used them quite a bit, and find them extremely useful.

The tool you use doesn't really make as much difference as many would have you believe. For about 95% of the inputs I've had to deal with, there's little enough difference between one and another that the best choice is simply the one with which I'm most familiar and comfortable.

Of course, lex and yacc produce (and demand that you write your actions in) C (or C++). If you're not comfortable with them, a tool that uses and produces a language you prefer (e.g. Python or Java) will undoubtedly be a much better choice. I, for one, would not advise trying to use a tool like this with a language with which you're unfamiliar or uncomfortable. In particular, if you write code in an action that produces a compiler error, you'll probably get considerably less help from the compiler than usual in tracking down the problem, so you really need to be familiar enough with the language to recognize the problem with only a minimal hint about where compiler noticed something being wrong.

Jerry Coffin
+3  A: 

The reasons why lex/yacc and derivatives seem so ubiquitous today are that they have been around for much longer than other tools, that they have far more coverage in the literature and that they traditionally came with Unix operating systems. It has very little to do with how they compare to other lexer and parser generator tools.

No matter which tool you pick, there is always going to be a significant learning curve. So once you have used a given tool a few times and become relatively comfortable in its use, you are unlikely to want to incur the extra effort of learning another tool. That's only natural.

Also, in the late 1960s and early 1970s when lex/yacc were created, hardware limitations posed a serious challenge to parsing. The table driven LR parsing method used by Yacc was the most suitable at the time because it could be implemented with a small memory footprint by using a relatively small general program logic and by keeping state in files on tape or disk. Code driven parsing methods such as LL had a larger minimum memory footprint because the parser program's code itself represents the grammar and therefore it needs to fit entirely into RAM to execute and it keeps state on the stack in RAM.

When memory became more plentiful a lot more research went into different parsing methods such as LL and PEG and how to build tools using those methods. This means that many of the alternative tools that have been created after the lex/yacc family use different types of grammars. However, switching grammar types also incurs a significant learning curve. Once you are familiar with one type of grammar, for example LR or LALR grammars, you are less likely to want to switch to a tool that uses a different type of grammar, for example LL grammars.

Overall, the lex/yacc family of tools is generally more rudimentary than more recent arrivals which often have sophisticated user interfaces to graphically visualise grammars and grammar conflicts or even resolve conflicts through automatic refactoring.

So, if you have no prior experience with any parser tools, if you have to learn a new tool anyway, then you should probably look at other factors such as graphical visualisation of grammars and conflicts, auto-refactoring, availability of good documentation, languages in which the generated lexers/parsers can be output etc etc. Don't pick any tool simply because "this is what everybody else seems to be using".

Here are some reasons I could think of for using lex/yacc or flex/bison :

  • the developer is already familiar with lex/yacc or flex/bison
  • the developer is most familiar and comfortable with LR/LALR grammars
  • the developer has plenty of books covering lex/yacc but no books covering others
  • the developer has a prospective job offer coming up and has been told that lex/yacc skills would increase his chances to get hired
  • the developer could not get buy-in from project members/stake holders for the use of other tools
  • the environment has lex/yacc installed and for some reason it is not feasible to install other tools
Jim Barker