What are some examples of errors a lexical analyzer could detect in a given piece of code in a language like Java, C++ or C?
A lexer can detect sequences of characters that have no possible meaning (where meaning is determined by the parser). For example, in Java, the sequence bana"na
cannot be an identifier, a keyword, an operator, etc.
However, a lexer cannot detect that a given lexically valid token is meaningless or ungrammatical. So a Java lexer, for example, would happily return the sequence of tokens final "banana" final "banana"
, seeing a keyword, a string constant, a keyword, and a string constant respectively.
I haven't double-checked the grammar but I think that a string like "2cat", for example, isn't any kind of valid/expected/categorizable token.
In addition to the cases mentioned below, most compilers also handle comments in the lexer. So, errors wrt comments (improperly nested, not closed) could also be detected here.
Another issue is the case of user-defined data types, which need to be handled together by the lexer and the parser. Consider the following code:
typedef int myinteger; myinteger x;
In the second statement, myinteger is a data type, and the lexer should return myinteger as a datatype, not as an indentifier. This is generally done by cross-referencing a potential indentifier with a list user-defined data types that had previously been populated by the parser.
A third issue concerns context of the token. In a context-sensitive language like C++, the same token (eg <) can have different meaning (less than, beginning of a template parameter). This also needs to be handled in cooperationw with the parser, which can give feedback to the lexer on it's current state.