views:

2481

answers:

2

I've been looking through the ANTLR v3 documentation (and my trusty copy of "The Definitive ANTLR reference"), and I can't seem to find a clean way to implement escape sequences in string literals (I'm currently using the Java target). I had hoped to be able to do something like:

fragment 
ESCAPE_SEQUENCE
    : '\\' '\'' { setText("'"); }
    ;

STRING  
    : '\'' (ESCAPE_SEQUENCE | ~('\'' | '\\'))* '\''
      { 
        // strip the quotes from the resulting token
        setText(getText().substring(1, getText().length() - 1));
      } 
    ;

For example, I would want the input token "'Foo\'s House'" to become the String "Foo's House".

Unfortunately, the setText(...) call in the ESCAPE_SEQUENCE fragment sets the text for the entire STRING token, which is obviously not what I want.

Is there a way to implement this grammar without adding a method to go back through the resulting string and manually replace escape sequences (e.g., with something like setText(escapeString(getText())) in the STRING rule)?

+4  A: 

Here is how I accomplished this in the JSON parser I wrote.

STRING   
@init{StringBuilder lBuf = new StringBuilder();}
    : 
           '"' 
           ( escaped=ESC {lBuf.append(escaped.getText());} | 
             normal=~('"'|'\\'|'\n'|'\r')     {lBuf.appendCodePoint(normal);} )* 
           '"'     
           {setText(lBuf.toString());}
    ;

fragment
ESC
    : '\\'
     ( 'n'    {setText("\n");}
     | 'r'    {setText("\r");}
     | 't'    {setText("\t");}
     | 'b'    {setText("\b");}
     | 'f'    {setText("\f");}
     | '"'    {setText("\"");}
     | '\''   {setText("\'");}
     | '/'    {setText("/");}
     | '\\'   {setText("\\");}
     | ('u')+ i=HEX_DIGIT j=HEX_DIGIT k=HEX_DIGIT l=HEX_DIGIT   {setText(ParserUtil.hexToChar(i.getText(),j.getText(),k.getText(),l.getText()));}

     )
    ;
Bruno Ranschaert
I used this approach, but note that I had to append "getText()" instead of "escaped.getText()" at each step. The fragment writes the unescaped text to the entire STRING token, which getText() returns. For me, escaped.getText() returns the original fragment with backslashes intact.
CapnNefarious
+2  A: 

I needed to do just that, but my target was C and not Java. Here's how I did it based on answer #1 (and comment), in case anyone needs something alike:

QUOTE   :      '\'';
STR
@init{ pANTLR3_STRING unesc = GETTEXT()->factory->newRaw(GETTEXT()->factory); }
        :       QUOTE ( reg = ~('\\' | '\'') { unesc->addc(unesc, reg); }
                        | esc = ESCAPED { unesc->appendS(unesc, GETTEXT()); } )+ QUOTE { SETTEXT(unesc); };

fragment
ESCAPED :       '\\'
                ( '\\' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\\")); }
                | '\'' { SETTEXT(GETTEXT()->factory->newStr8(GETTEXT()->factory, (pANTLR3_UINT8)"\'")); }
                )
        ;

HTH.