ansaurus

Question

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

Answer 1

+4 A:

There's two tools that comes to mind, although you would need to find out for yourself which would be suitable, Antlr and GoldParser. There are language bindings available in both tools in which it can be plugged into the C++ runtime environment.

Hope this helps, Best regards, Tom.

tommieb75 2010-01-30 23:08:35

Speaking of which when I went for a spot of googling, there is apparently a object oriented version of flex/bison using the boost library - check it out http://dudka.cz/vyp08

tommieb75 2010-01-30 23:22:03

I already knew about antler and found it unsuitable. I's a parser generator and it requires the installation of the java runtime, which in my experience is unstable and annoying on Windows. GoldParser looks promising though - thanks.

John Knoeller 2010-01-30 23:26:40

@tommeib75: dudka.cz good find. maybe he has a better way of wranging the flex output into clean code.

John Knoeller 2010-01-31 19:07:15

+1 (wish it could be more) for ANTLR. ANTLRworks is great for visualizing your work and for debug stepping through it, http://www.antlr.org/works/screenshots/editor.jpg

Mawg 2010-02-03 00:20:01

Answer 2

+2 A:

Flex also has a C++ output option.
The result is a set of classes that do that parsing.

Just add the following to the head of you lex file:

%option C++
%option yyclass="Lexer"

Then in you source it is:

std::fstream  file("config");
Lexer         lexer(&file)
while(int token = lexer.yylex())
{
}

Martin York 2010-01-30 23:18:35

Yeah, I tried this, it's even worse than the C code IMO, but it does solve the golbal variables and multiple instance problem.

John Knoeller 2010-01-30 23:31:37

As you don't maintain the produced code its asthetic niceties are irrelavant. Your source file is the lex file. As long as the code works does not pollute the global namespace the implementation details are irrelavant. Note1: Don't build the source files only build the object files and headers (ie delete the source immediately). Make the dependency between the object and lex file not the source file. Note: They are messey because to make it effecient.

Martin York 2010-01-31 18:37:29

Answer 3

+5 A:

Boost.Spirit.Qi (parser-tokenizer) or Boost.Spirit.Lex (tokenizer only). I absolutely love Qi, and Lex is not bad either, but I just tend to take Qi for my parsing needs...

The only real drawback with Qi tends to be an increase in compile time, and it is also runs slightly slower than hand-written parsing code. It is generally much faster than parsing with regex, though.

http://www.boost.org/doc/libs/1_41_0/libs/spirit/doc/html/index.html

Tronic 2010-01-30 23:19:52

Thanks, I'll have a look at Boost.Spirit.Lex

John Knoeller 2010-01-31 00:06:21

Answer 4

+2 A:

boost.spirit and Yard parser come to my mind. Note that the approach of having lexer generators is somewhat substituted by C++ inner DSL (domain-specific language) to specify tokens. Simply because it is part of your code without using an external utility, just by following a series of rules to specify your grammar.

Diego Sevilla 2010-01-30 23:22:39

Answer 5

+6 A:

Ragel: http://www.complang.org/ragel/ It fits most of your requirements.

It runs on Windows
It doesn't declare the variables, so you can put them inside a class or inside a function as you like.
It has nice tools for analyzing regular expressions to see when they would backtrack. (I don't know about this very much, since I never use syntax in Ragel that would create a backtracking parser.)
Variable names can't be changed.
Table names are prefixed with the machine name, and they're declared "const static", so you can put more than one in the same file and have more than one with the same name in a single program (as long as they're in different files).
You can declare the variables as any integer type, including UChar (or whatever UTF-16 type you prefer). It doesn't automatically handle surrogate pairs, though. It doesn't have special character classes for Unicode either (I think).
It only does regular expressions... has no bison/yacc features.

The code it generates interferes very little with a program. The code is also incredibly fast, and the Ragel syntax is more flexible and readable than anything I've ever seen. It's a rock solid piece of software. It can generate a table-driven parser or a goto-driven parser.

Dietrich Epp 2010-01-30 23:25:35

+1 this looks promising.

John Knoeller 2010-01-31 00:04:11

I use this myself and strongly agree. There's also Kelbt, by the same author, for parsing - it's not a release version and it has its quirks (in particular, it doesn't do associativity or precedence disambiguation) but I've been using it anyway and the only crashes I've had were build-time and resulted from code errors (referencing a non-existent token in a parse action).

Steve314 2010-01-31 03:37:23

I also used treecc for AST node definition with multiple-dispatch operations, in combination with Kelbt and Ragel. These days, I have my own replacement with some additional tricks (e.g. AST-traversing iterator classes) - I may even release it, one day...

Steve314 2010-01-31 03:43:54

@Steve314: Please let us know if you do.

John Knoeller 2010-02-01 18:58:55

Answer 6

+1 A:

You could try http://www.benhanson.net/lexertl.html. It is a header only library and allows you to build a lexer (very quickly) at runtime. It has most of the features of flex and works with wide characters.

Regards,

Ben

Ben Hanson 2010-10-22 09:59:30

ansaurus

tags:

views:

answers:

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

related questions