views:

324

answers:

1

I'm creating a grammar using JavaCC and have run across a small problem. I'm trying to allow for any valid character within the ASCII extended set to be recognized by the resulting compiler. After looking at the same JavaCC examples (primarily the example showing the JavaCC Grammer itself) I set up the following token to recognize my characters:

< CHARACTER:

  (   (~["'"," ","\\","\n","\r"])
    | ("\\"
        ( ["n","t","b","r","f","\\","'","\""]
        | ["0"-"7"] ( ["0"-"7"] )?
        | ["0"-"3"] ["0"-"7"] ["0"-"7"]
        )
      )
  )

>

If I'm understanding this correctly it should be matching on the octal representation of all of the ASCII characters, from 0-377 (which covers all 256 characters in the Extended ASCII Set). This performs as expected for all keyboard characters (a-z, 0-9, ?,./ etc) and even for most special characters (© , ¬ ®). However, whenever I attempt to parse the 'trademark' symbol (™) my parser continually throws an End of File exception, indicating that it is unable to recognize the symbol. Is there some obvious way that I can enhance my definition of a character to allow the trademark symbol to be accepted?

+1  A: 

It turns out that what I wanted my grammar to do was to accept all valid Unicode characters and not ASCII characters, the ™ symbol is part of the Unicode specification and not in an ASCII extended character set. Changing my token for a valid character as outlined below solved my problem: (A valid unicode being of the format- U+00FF)

< CHARACTER:(   (~["'"," ","\\","\n","\r"])
| ("\\"
    ( ["n","t","b","r","f","\\","'","\""]
    | ["u","U"]["+"]["0"-"9","a"-"f","A"-"F"]["0"-"9","a"-"f","A"-"F"]["0"-"9","a"-"f","A"-"F"]["0"-"9","a"-"f","A"-"F"]
    )
  ) )>
RGordon1982