views:

997

answers:

3

I'm experimenting to learn flex and would like to match string literals. My code currently looks like:

"\""([^\n\"\\]*(\\[.\n])*)*"\""        {/*matches string-literal*/;}

I've been struggling with variations for an hour or so and can't get it working the way it should. I'm essentially hoping to match a string literal that can't contain a new-line (unless it's escaped) and supports escaped characters.

I am probably just writing a poor regular expression or one incompatible with flex. Please advise!

+3  A: 

You'll find these links helpful

codaddict
I registered for the site, but it still wont let me up-vote since this was my first question.
Thomas
@Thomas: Oh I see, not a problem, I'll do that on your behalf :)
codaddict
+5  A: 

A string consists of a quote mark

"

followed by zero or more of either an escaped anything

\\.

or a non-quote character

[^"]

and finally a terminating quote

"

Put it all together, and you've got

"(\\.|[^"])*"
Jonathan Feinberg
+1, for the clear explanation of whats going on.
codaddict
This doesn't handle escaping, unfortunately. So this would incorrectly lex `"\""`
Paul Biggar
You must have missed "zero or more of an escaped anything"?
Jonathan Feinberg
+1  A: 

How about using a start state...

int enter_dblquotes = 0;

%x DBLQUOTES
%%

\"  { BEGIN(DBLQUOTES); enter_dblquotes++; }

<DBLQUOTES>*\" 
{ 
   if (enter_dblquotes){
       handle_this_dblquotes(yytext); 
       BEGIN(INITIAL); /* revert back to normal */
       enter_dblquotes--; 
   } 
}
         ...more rules follow...

It was similar to that effect (flex uses %s or %x to indicate what state would be expected. When the flex input detects a quote, it switches to another state, then continues lexing until it reaches another quote, in which it reverts back to the normal state.

Hope this helps, Best regards, Tom.

tommieb75
Overly complex isn't it?
samoz
@Samoz: Not really, it's actually used in languages where string literals are used, it eats up what's between a beginning quote and an end quote, even if there's extra quotes inside it hence the usage of switching states in order to chew up the quotes...
tommieb75