views:

540

answers:

3

Hello,

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.

I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.

I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.

Mark

+9  A: 

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).

A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.

A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.

Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".

Roger Lipscombe
No doubt "The Dragon Book" is a good book, but it does require the reader to have a good grounding in CS. Some book with more practical appeal would be "Writing Compilers and Interpreters" by Ronald Mak, "Modern Compiler Implementation", Andrew Appel; "Compiler Construction", Niklaus Wirth; "Compiling with C# and Java" and "Compilers and Compiler Generators: an Introduction with C++" by Pat Terry; and, of course, "The Definitive ANTLR Reference" by Terrence Parr.
Andre Artus
Sure. Last I checked, I was doing a CS degree :-) I defer to your more-recent recommendations.
Roger Lipscombe
Just to be sure, I am not knocking your recommendation. "The Dragon Book" was my first book on compiler tech, but it was hard going compared to, say, Wirth's book, which is a book you can grok in a few hours. Back then I had few options as it was the only book I could get my hands on (it being 1991, before Amazon and the WWW). I had that and a collection of text files produced by Jack W. Crenshaw called "LET'S BUILD A COMPILER" (thanks Jack!). This is still the book to get for a more complete understanding of the principles, but most programmers just need a pragmatic introduction.
Andre Artus
+1  A: 

I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.

I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.

Will Dean
With PEG parsers the distinction between tokenizer and parser is even less clear.
Andre Artus
+1  A: 

Example:

int x = 1;

A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.

A parser will take those tokens and use them to understand in some way:

  • we have a statement
  • it's a definition of an integer
  • the integer is caled 'i'
  • i should be initialised with the value 1
anon
A lexer will note that "int", "=", and ";" are tokens without further meaning, that "x" is a identifier name or something, value "x", and "1" is an integer or number, value "1". A tokenizer won't necessarily do that.
David Thornley