tags:

views:

545

answers:

2

I'm trying to make a Bison parser to handle UTF-8 characters. I don't want the parser to actually interpret the Unicode character values, but I want it to parse the UTF-8 string as a sequence of bytes.

Right now, Bison generates the following code which is problematic:

  if (yychar <= YYEOF)
    {
      yychar = yytoken = YYEOF;
      YYDPRINTF ((stderr, "Now at end of input.\n"));
    }

The problem is that many bytes of the UTF-8 string will have a negative value, and Bison interprets negative values as an EOF, and stops.

Is there a way around this?

+5  A: 

bison yes, flex no. The one time I needed a bison parser to work with UTF-8 encoded files I ended up writing my own yylex function.

edit: To help, I used a lot of the Unicode operations available in glib (there's a gunicode type and some file/string manipulation functions that I found useful).

eduffy
Well, my lexer handles the UTF-8 chars just fine, but the Bison parser stops parsing as soon as it sees a negative value. Please advise.
Martin Cote
Are you reading your file 1 byte at a time? or 1 utf-8 encoded character at a time?
eduffy
1 byte at a time.
Martin Cote
Then that's the problem. The bit that signifies a 'char' is negative in ASCII is the same bit that tells a UTF-8 char that it is more than 1 byte in length (IIRC). You need to use something other than fgetc.
eduffy
+2  A: 

flex being the issue here, you might want to take a look at zlex.

chaos
That's an interesting project, but wouldn't exactly solve the problem addressed in this question. 16-bit characters are different from UTF-8 encoded characters (for one thing UTF-8 can be up to 4 bytes in length).
eduffy