We are processing IBMEnterprise Japanese COBOL source code.
The rules that describe exactly what is allowed in G type literals, and what are allowed for identifiers are unclear.
The IBM manual indicates that a G'....' literal must have a SHIFT-OUT as the first character inside the quotes, and a SHIFT-IN as the last character before the closing quote. Our COBOL lexer "knows" this, but objects to G literals found in real code. Conclusion: the IBM manual is wrong, or we are misreading it. The customer won't let us see the code, so it is pretty difficult to diagnose the problem.
EDIT: Revised/extended below text for clarity:
Does anyone know the exact rules of G literal formation, and how they (don't) match what the IBM reference manuals say? The ideal answer would a be regular expression for the G literal. This is what we are using now (coded by another author, sigh):
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
where <name> is a macro that is another regular expression. Presumably they are named well enough so you can guess what they contain.
Here is the IBM Enterprise COBOL Reference. Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading. I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means when it says "one or more characters in the range X'00...X'FF for either byte" How can DBCS-characters be anything but pairs of 8-bit character codes? The existing RE matches 3 types of pairs of characters if you examine it.
One answer below suggests that the <squote><squote> pairing is wrong. OK, I might believe that, but that means the RE would only reject literal strings containing single <squote>s. I don't believe that's the problem we are having as we seem to trip over every instance of a G literal.
Similarly, COBOL identifiers can apparantly be composed with DBCS characters. What is allowed for an identifier, exactly? Again a regular expression would be ideal.
EDIT2: I'm beginning to think the problem might not be the RE. We are reading Shift-JIS encoded text. Our reader converts that text to Unicode as it goes. But DBCS characters are really not Shift-JIS; rather, they are binary-coded data. Likely what is happening is the that DBCS data is getting translated as if it were Shift-JIS, and that would muck up the ability to recognize "two bytes" as a DBCS element. For instance, if a DBCS character pair were :81 :1F, a ShiftJIS reader would convert this pair into a single Unicode character, and its two-byte nature is then lost. If you can't count pairs, you can't find the end quote. If you can't find the end quote, you can't recognize the literal. So the problem would appear to be that we need to switch input-encoding modes in the middle of the lexing process. Yuk.