I am writing my first parser and have a few questions conerning the tokenizer.
Basically, my tokenizer exposes a nextToken()
function that is supposed to return the next token. These tokens are distinguished by a token-type. I think it would make sense to have the following token-types:
- SYMBOL (such as
<
,:=
,(
and the like - WHITESPACE (tab, newlines, spaces...)
- REMARK (a comment between /* ... */ or after // through the new line)
- NUMBER
- IDENT (such as the name of a function or a variable)
- STRING (Something enclosed between "....")
Now, do you think this makes sense?
Also, I am struggling with the NUMBER
token-type. Do you think it makes more sense to further split it up into a NUMBER
and a FLOAT
token-type? Without a FLOAT
token-type, I'd receive NUMBER
(eg 402), a SYMBOL
(.) followed by another NUMBER
(eg 203) if I were about to parse a float.
Finally, what do you think makes more sense for the tokenizer to return when it encounters a -909
? Should it return the SYMBOL
-
first, followed by the NUMBER
909
or should it return a NUMBER
-909
right away?