tags:

views:

245

answers:

1

Am learning lex in this process, I'm generating tokens for the C language, and am trying to recognize single line comments "//", but am having a conflict with the division operator

[1-9][0-9]*|0x[0-9a-fA-F][0-9a-fA-F]*           return NUMBER;
[a-zA-Z][a-zA-Z0-9]*                            return IDENT;
/                                               {return DIVIDE;}

[ \t\r\n]
[//]

But when am running the example and entering // it's recognizing them as 2 division operators. Where should I be modifying the code. Any suggestions.

Edit:

Lex Code:

%{
#include "y.tab.h"
%}
%array
%%
if                                              {return IF;}
while                                           {return WHILE;}
else                                            {return ELSE;}
int                                             {return INT;}
return                                          {return RETURN;}
\/\/[^\r\n]*
[1-9][0-9]*|0x[0-9a-fA-F][0-9a-fA-F]*           return NUMBER;
[a-zA-Z][a-zA-Z0-9]*                            return IDENT;

[+]                                             {return ADD;}
[-]                                             {return SUB;}
[<]                                             {return LESS;}
[>]                                             {return GREAT;}
[*]                                             {return MULT;}
[/]                                             {return DIVIDE;}
[;]                                             {return SEMICOLON;}

\{                                              return LBRACE;
\}                                              return RBRACE;

[ \t\r\n]

\(                                              return LPAREN;

\)                                              return RPAREN;

.                                               return BADCHAR;
%%

The following is the header file I use

typedef enum {END=0, WHILE, IF, ELSE,RETURN, IDENT, LPAREN, RPAREN,INT,LBRACE,RBRACE, SEMICOLON, EQUALITY, DIVIDE, MULT, LESS, GREAT,
 ADD, SUB, NUMBER,BADCHAR} Token;

The following is the input am running,

//
/
p
Token 16, text /
Token 16, text /
Token 16, text /
Token 5, text p

When am running it, comments are consumed and even the divide operator is ignored. But check when am entering p, it classifies the operators listed above, which it shouldn't be doing.

Note: Am trying to ignore tabs, newline characters and single line comments. 

Note 2: \/\/[^\r\n]* I have understood where I committed the mistake and wanted to share this.
+4  A: 

According to the Lex manual:

The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream will be backed up to the end of the current partition, so that the user has general freedom to manipulate it.

So you should not need to do anything special - // is longer than / so it will prefer a comment over a division operator when it sees two. However, you didn't post your comment rule - where is it?

Edit: never mind, I see it. [//] is a character class. Remove the square brackets. Also, you will want to match to the end of the line - otherwise you will only allow empty comments. So your regex should be something like:

//[^\r\n]*\r\n (adjust as necessary for the newline characters you are supporting - this one requires that a newline be exactly \r\n).

Edit 2: @tur1ng brings up a good point - the last line in your file may not end with a newline. I looked it up and Lex supports <<EOF>> in its regexes also (see http://pltplp.net/lex-yacc/lex.html.en). So you could change to:

//[^\r\n]*((\r\n)|<<EOF>>)

danben
Am trying to ignore tabs, newline characters and single line comments
Right, I know - see the edit in my answer.
danben
You are right. shadowing of the match was not the issue, I've removed my answer.
John Knoeller
What happend if the last line contains something like '// ...' will '//[^\r\n]*\r\n' fail and '//[^\r\n]*' be the right way?
tur1ng
At the first instance when am running the example, //sToken 16, text /Token 16, text /Token 5, text sThe first time I enter a comment it is consumed but next I enter an s, it recognizes division twice and then it recognizes a s.
Please post your updated code.
danben
Also, please post the test case in your question with proper formatting so I can see exactly what your input is and what the the result is.
danben