ansaurus

Question

Answer 1

+5 A:

You are best served by making your token types closely match your grammar's terminal symbols.

Without knowing the language/grammar, I expect you would be better served by having token types for "LESS_THAN", "LESS_THAN_OR_EQUAL" and also "FLOAT", "DOUBLE", "INTEGER", etc.

dty 2010-06-11 12:10:12

Answer 2

A:

It depends on how you are taking in tokens, if you are doing it character by character, then it might be a bit tricky, but if you are doing it word by word i.e.

int a = a + 2.0

then the tokens would be (discarding whitespace)

int
a
=
a
+
2.0

So you wouldn't run into the situation where you interpret the . as at token but rather take the whole string in - which is where you can determine if it's a FLOAT or NUMBER or whatever you want.

djhworld 2010-06-11 12:14:29

Answer 3

+2 A:

I think that the answer to your question is strictly tied to the semantic of NUMBER. What a NUMBER should be? An always positive integer, a float...

I'd like to suggest you to lookup to the flex and yacc (aka lex & bison) tools of the U**x operating systems: these are powerful parsers and scanners generators that take a grammar and output a compilable and readily usable program.

Alessandro Baldoni 2010-06-11 12:15:30

+1 - if you're not doing it for learning use a tool. Also check out ANTLR for the Java world.

cyborg 2010-06-11 12:18:23

Yes, that's right. But I am writing my parser for educational purposes so as to understand the details of a parser. That's why I want to create it from scratch.

René Nyffenegger 2010-06-11 12:18:39

All those options are open source if you're really stuck and want to see something that's already been done.

cyborg 2010-06-11 12:28:41

Answer 4

+2 A:

It depends upon your target language.

The point behind a lexer is to return tokens that make it easy to write a parser for your language. Suppose your lexer returns NUMBER when it sees a symbol that matches "[0-9]+". If it sees a non-integer number, such as "3.1415926" it will return NUMBER . NUMBER. While you could handle that in your parser, if your lexer is doing an appropriate job of skipping whitespace and comments (since they aren't relevant to your parser) then you could end up incorrectly parsing things like "123 /* comment / . \n / other comment */ 456" as floating point numbers.

As for lexing "-[0-9]+" as a NUMBER vs MINUS NUMBER again, that depends upon your target language, but I would usually go with MINUS NUMBER, otherwise you would end up lexing "A = 1-2-3-4" as SYMBOL = NUMBER NUMBER NUMBER NUMBER instead of SYMBOL = NUMBER MINUS NUMBER MINUS NUMBER MINUS NUMBER.

While we're on the topic, I'd strongly recommend the book Language Implementation Patterns, by Terrance Parr, the author of ANTLR.

Craig Trader 2010-06-11 12:22:35

accepted because of pointing out that lexing `A = 1-2-3-4` could pose a problem if lexed as `NUMBER` `NUMBER`....

René Nyffenegger 2010-06-11 13:19:15

Answer 5

+2 A:

From my experience with actual lexers:

Make sure to check if you actually need comment / whitespace tokens. Compilers typically don't need them, while IDEs often do (to color comments green, for example).
Usually there's no single "operator" token; instead, there's a token for each distinct operator. So there's a PLUS token and AMPERSAND token and LESSER_THAN token etc.. This means that you only care about the lexeme (the actual text matched) when the token is an identifier or some sort of literal.
Avoid splitting literals. If "hello world" is a string literal, parse it as a single token. If -3.058e18 is a float literal, parse it as a single token as well. Lexers usually rely on regular expressions, which are expressive enough for all these things, and more. Of course, if the literals are complex enough you have to split them (e.g. the block literal in Smalltalk).

Oak 2010-06-11 12:26:33

ansaurus

tags:

views:

answers:

Tokenizing numbers for a parser

related questions