views:

253

answers:

2

Hi, I'm currently working my way through this book: http://www1.idc.ac.il/tecs/

I'm currently on a section where the excersize is to create a compiler for a very simple java like language. The book always states what is required but not the how the how (which is a good thing). I should also mention that it talks about yacc and lex and specifically says to avoid them for the projects in the book for the sake of learning on your own.

I'm on chaper 10 which and starting to write the tokenizer.

1) Can anyone give me some general advice - are regex the best approach for tokenizing a source file?

2) I want to remove comments from source files before parsing - this isn't hard but most compilers tell you the line an error occurs on, if I just remove comments this will mess up the line count, are there any simple strategies for preserving the line count while still removing junk?

Thanks in advance!

+2  A: 

The tokenizer itself is usually written using a large DFA table that describes all possible valid tokens (stuff like, a token can start with a letter followed by other letters/numbers followed by a non-letter, or with a number followed by other numbers and either a non-number/point or a point followed by at least 1 number and then a non-number, etc etc). The way i built mine was to identify all the regular expressions my tokenizer will accept, transform them into DFA's and combine them.

Now to "remove comments", when you're parsing a token you can have a comment token (the regex to parse a comment, too long to describe in words), and when you finish parsing this comment you just parse a new token, thus ignoring it. Alternatively you can pass it to the compiler and let it deal with it (or ignore it as it will). Either aproach will preserve meta-data like line numbers and characters-into-the-line.

edit for DFA theory:

Every regular expression can be converted (and is converted) into a DFA for performance reasons. This removes any backtracking in parsing them. This link gives you an idea of how this is done. You first convert the regular expression into an NFA (a DFA with backtracking), then remove all the backtracking nodes by inflating your finite automata.

Another way you can build your DFA is by hand using some common sense. Take for example a finite automata that can parse either an identifier or a number. This of course isn't enough, since you most likely want to add comments too, but it'll give you an idea of the underlying structures.

          A-Z       space
->(Start)----->(I1)------->((Identifier))
     |         | ^
     |         +-+
     |        A-Z0-9
     |
     |          space   
     +---->(N1)---+--->((Number)) <----------+
      0-9  | ^    |                          |
           | |    | .       0-9       space  |
           +-+    +--->(N2)----->(N3)--------+
           0-9                   | ^
                                 +-+
                                 0-9

Some notes on the notation used, the DFA starts at the (Start) node and moves through the arrows as input is read from your file. At any one point it can match only ONE path. Any paths missing are defaulted to an "error" node. ((Number)) and ((Identifier)) are your ending, success nodes. Once in those nodes, you return your token.

So from the start, if your token starts with a letter, it HAS to continue with a bunch of letters or numbers and end with a "space" (spaces, new lines, tabs, etc). There is no backtracking, if this fails the tokenizing process fails and you can report an error. You should read a theory book on error recovery to continue parsing, its a really huge topic.

If however your token starts with a number, it has to be followed by either a bunch of numbers or one decimal point. If there's no decimal point, a "space" has to follow the numbers, otherwise a number has to follow followed by a bunch of numbers followed by a space. I didn't include the scientific notation but it's not hard to add.

Now for parsing speed, this gets transformed into a DFA table, with all nodes on both the vertical and horizontal lines. Something like this:

             I1            Identifier  N1      N2      N3      Number
start        letter        nothing     number  nothing nothing nothing
I1           letter+number space       nothing nothing nothing nothing
Identifier   nothing       SUCCESS     nothing nothing nothing nothing
N1           nothing       nothing     number  dot     nothing space
N2           nothing       nothing     nothing nothing number  nothing
N3           nothing       nothing     nothing nothing number  space
Number       nothing       nothing     nothing nothing nothing SUCCESS

The way you'd run this is you store your starting state and move through the table as you read your input character by character. For example an input of "1.2" would parse as start->N1->N2->N3->Number->SUCCESS. If at any point you hit a "nothing" node, you have an error.

edit 2: the table should actually be node->character->node, not node->node->character, but it worked fine in this case regardless. It's been a while since i last written a compiler by hand.

Blindy
Unfortunately I didn't do a computer science degree, I take it a DFA table is Deterministic Finite Automata table? did a bit of googling and found this: http://www.cs.odu.edu/~toida/nerzic/390teched/regular/fa/dfa-definitions.htmlCould you expand a bit on the "The way i built mine was to identify all the regular expressions my tokenizer will accept, transform them into DFA's and combine them." thanks in advance!
bplus
Hi there, thank you very much for the detailed answer. Very kind of you to bother explaining this stuff.
bplus
Np, glad it helped!
Blindy
A: 

1- Yes regex are good to implement the tokenizer. If using a generated tokenizer like lex, then you describe the each token as a regex. see Mark's answer.

2- The lexer is what normally tracks line/column information, as tokens are consumed by the tokenizer, you track the line/column information with the token, or have it as current state. Therefore when a problem is found the tokenizer knows where you are. Therefore when processing comments, as new lines are processed the tokenizer just increments the line_count.

In Lex you can also have parsing states. Multi-line comments are often implemented using these states, thus allowing simpler regex's. Once you find the match to the start of a comment eg '/*' you change into comment state, which you can setup to be exclusive from the normal state. Therefore as you consume text looking for the end comment marker '*/' you do not match normal tokens.

This state based process is also useful for process string literals that allow nested end makers eg "test\"more text".

Simeon Pilgrim