ansaurus

Question

Looking for a clear definition of what a "tokenizer", 'parser" and "lexers" are and how they are related to each other and used?

Answer 1

+9 A:

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).

A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.

A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.

Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".

Roger Lipscombe 2008-12-19 09:25:14

No doubt "The Dragon Book" is a good book, but it does require the reader to have a good grounding in CS. Some book with more practical appeal would be "Writing Compilers and Interpreters" by Ronald Mak, "Modern Compiler Implementation", Andrew Appel; "Compiler Construction", Niklaus Wirth; "Compiling with C# and Java" and "Compilers and Compiler Generators: an Introduction with C++" by Pat Terry; and, of course, "The Definitive ANTLR Reference" by Terrence Parr.

Andre Artus 2010-06-08 06:35:24

Sure. Last I checked, I was doing a CS degree :-) I defer to your more-recent recommendations.

Roger Lipscombe 2010-06-08 08:15:30

Just to be sure, I am not knocking your recommendation. "The Dragon Book" was my first book on compiler tech, but it was hard going compared to, say, Wirth's book, which is a book you can grok in a few hours. Back then I had few options as it was the only book I could get my hands on (it being 1991, before Amazon and the WWW). I had that and a collection of text files produced by Jack W. Crenshaw called "LET'S BUILD A COMPILER" (thanks Jack!). This is still the book to get for a more complete understanding of the principles, but most programmers just need a pragmatic introduction.

Andre Artus 2010-06-08 17:54:51

Answer 2

+1 A:

I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.

I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.

Will Dean 2008-12-19 09:25:41

With PEG parsers the distinction between tokenizer and parser is even less clear.

Andre Artus 2010-06-08 06:38:44

Answer 3

+1 A:

Example:

int x = 1;

A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.

A parser will take those tokens and use them to understand in some way:

we have a statement
it's a definition of an integer
the integer is caled 'i'
i should be initialised with the value 1

anon 2009-03-26 17:37:37

A lexer will note that "int", "=", and ";" are tokens without further meaning, that "x" is a identifier name or something, value "x", and "1" is an integer or number, value "1". A tokenizer won't necessarily do that.

David Thornley 2009-03-26 17:54:13

ansaurus

tags:

views:

answers:

Looking for a clear definition of what a "tokenizer", 'parser" and "lexers" are and how they are related to each other and used?

related questions