views:

470

answers:

2

We are processing IBMEnterprise Japanese COBOL source code.

The rules that describe exactly what is allowed in G type literals, and what are allowed for identifiers are unclear.

The IBM manual indicates that a G'....' literal must have a SHIFT-OUT as the first character inside the quotes, and a SHIFT-IN as the last character before the closing quote. Our COBOL lexer "knows" this, but objects to G literals found in real code. Conclusion: the IBM manual is wrong, or we are misreading it. The customer won't let us see the code, so it is pretty difficult to diagnose the problem.

EDIT: Revised/extended below text for clarity:

Does anyone know the exact rules of G literal formation, and how they (don't) match what the IBM reference manuals say? The ideal answer would a be regular expression for the G literal. This is what we are using now (coded by another author, sigh):

#token non_numeric_literal_quote_g [STRING]
  "<G><squote><ShiftOut> (  
     (<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)  
     (<NotLineOrParagraphSeparator>|<squote><squote>)

     | <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
                   <ShiftIn>|<ShiftOut>)

     | <squote><squote>

 )* <ShiftIn><squote>"

where <name> is a macro that is another regular expression. Presumably they are named well enough so you can guess what they contain.

Here is the IBM Enterprise COBOL Reference. Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading. I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means when it says "one or more characters in the range X'00...X'FF for either byte" How can DBCS-characters be anything but pairs of 8-bit character codes? The existing RE matches 3 types of pairs of characters if you examine it.

One answer below suggests that the <squote><squote> pairing is wrong. OK, I might believe that, but that means the RE would only reject literal strings containing single <squote>s. I don't believe that's the problem we are having as we seem to trip over every instance of a G literal.

Similarly, COBOL identifiers can apparantly be composed with DBCS characters. What is allowed for an identifier, exactly? Again a regular expression would be ideal.

EDIT2: I'm beginning to think the problem might not be the RE. We are reading Shift-JIS encoded text. Our reader converts that text to Unicode as it goes. But DBCS characters are really not Shift-JIS; rather, they are binary-coded data. Likely what is happening is the that DBCS data is getting translated as if it were Shift-JIS, and that would muck up the ability to recognize "two bytes" as a DBCS element. For instance, if a DBCS character pair were :81 :1F, a ShiftJIS reader would convert this pair into a single Unicode character, and its two-byte nature is then lost. If you can't count pairs, you can't find the end quote. If you can't find the end quote, you can't recognize the literal. So the problem would appear to be that we need to switch input-encoding modes in the middle of the lexing process. Yuk.

A: 

Try to add a single quote in your rule to see if it passes by making this change,

  <squote><squote> => <squote>{1,2}

If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.

EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,

  G"ABC<ヲァィ>" <> are Shift-out/shift-in

You RE assumes the DBCS only. I would loose this restriction and try again.

I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.

You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.

I am just trying to help you shoot in the dark without seeing the actual code :)

ZZ Coder
You mean as an opening or closng quote? The squote pair in midstring is intended to represent a squote in midstring, not one at the beginning or end. I'll go double check the syntax carefully, but are you sure?
Ira Baxter
According to my memory, you don't need to escape midstring quote in G-string. For N-string, you need to double it so your rule is for N-string. I threw my manual away years ago so I have no way to confirm this.
ZZ Coder
Ah, the light is beginning to dawn. To help you, I've pointed to the manual so you can read it again <i>grin</i>; I also restructured the RE I have to make it easier to understand but didn't change it. The manuals are conspicously quiet about quote marks in G literals, but it clearly doesn't say they should be doubled up, so I going to assume your right on that part (tick!).Any further comments on my revised text?
Ira Baxter
See my edits ......... 15 yet?
ZZ Coder
The IBM manual clearly states that G literals *must* start and end with shift-in and shift-out. Your example shows that part of the manual is wrong (even if well intended, but its supposed to be a reference manual). Other (nonspecial) string literals, according to the manual, can mix DBCS and SBCS the way you show. We're seeing the code in Shift-JIS, but our tool internally translates that to Unicode. I'll check on how the SI and SO characters are mapped.
Ira Baxter
Are you sure you read the manual correctly? It probably says DBCS must be surrounded by SO/SI but G literals can contain a mixture of SBCS and DBCS segments. If your RE is correct, I would have to write the mixed string as G"<>ABC<ヲァィ>". I don't remember doing that at all.
ZZ Coder
There's a link to the manual in my answer. On page 32: "The opening delimiter must be followed immediately by a shift-out control character." Our RE is more general than implied by the manual, but that's an attempt to reconcile what the manual says with your observation that literals seem to allow mixed SBCS and DBCS.
Ira Baxter
<i>Problem not solved</i>. In the absence of a better answer, I'm giving you the credit anyway on the grounds you were at least close.
Ira Baxter
Thanks! Once you have a failcase, this should be an easy problem. Let me give you a suggestion. Ask if you customer is willing to share the data in masqueraded form. For example, Q for quote, I for Shift-In, J for Japanese, A for ASCII etc. So you get something like "QOJJJIAAAOJJJJIQ". Should be secure enough.
ZZ Coder
See EDIT2... might be input encoding problem.
Ira Baxter
Shift-JIS requires stateful decoder. We had to switch to ICU to handle the SO/SI correctly.
ZZ Coder
@ZZ Coder: If by stateful you mean "process sometimes two bytes" we do that. What was it about SO/SI that required funny processing? (WHat's "ICU"?)
Ira Baxter
I tried to convert some COBOL literals to UTF-8 using Java years ago. The Java didn't handle SO/SI correctly. ICU http://site.icu-project.org/ is developed by IBM so it knows about all the IBM's quirky encodings.
ZZ Coder
A: 

Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...

I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.

lcv
It is ~[\u000d \u000a \u0009 \' \u0028 \u2029 \u000e \u000f]. It can't consume the closing <squote>.
Ira Baxter
How about \"? Is this is supposed to only match constant of the type G'< ... >' or of the type G"< ... >" ?
lcv
Yes, there's an analogous one for G"<....>". If I get one right, the other is easy to fix.
Ira Baxter
Have you tried to simplify the rule? It seems it was copied from the definition of the other literals.Wouldn't something like "<AnythingButShiftIn><Anything> | <ShiftIn><Notsquote>" be sufficient for the inside portion ( inside the ( ... ) * )?Is there any possibility of any EBCDIC/ASCII DBCS conversions throwing a wrench in the works?By the way, which parser / lexer are you using? Is this an in-house development?
lcv
Not sure if it means anything, but just out of curiosity, I checked the Microfocus Object Cobol language reference to see how they handled DBCS literals, and they seemed to do it slightly differently. Rather than require the shift-in / shift-out character, they treated the G' as the beginning delimiter, and ' had to appear twice if it was a DCBS character, otherwise it was the closing delimiter. It may be something that's left up to the compiler, or it may be that your intuition was correct... URL: http://supportline.microfocus.com/documentation/books/ocds42/atdbcs.htm#s014
lcv
Further exploration shows that some COBOL compilers have an option associated with this: --- The SOSI option (ref: COBOL for AIX: http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=/com.ibm.aix.cbl.doc/up4515.htm ) --- The default is NOSOSI. Whether a shift-out shift-in is required before and after the quotations may be optional depending on the compiler option .. at least this is true for other compilers.
lcv
See EDIT2 revision to question; might be input encoding problem.
Ira Baxter
@lcv: The parser/lexer being used is the DMS Software Reengineering Toolkit; this is an engine for building program analysis and transformation tools, and we are finishing up handling IBM Enterprise COBOL. Yes, its an inhouse development, but no, it doesn't have to be, as DMS is available commercially. See http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html
Ira Baxter
@EDIT2: That certainly looks like it could be it. I was about to suggest treating the encoding as bytes, but then I saw a previous question you posted regarding Shift-JIS and how it could appear in identifiers and looking at the documentation. Perphaps treating it as you say may work, but there's a lingering doubt in my mind: The code you are given is supposedly Shift-JIS encoded, but what the Enterprise COBOL manual says is supported seems to be are EBCDIC DBCS encoded in literals, comments, and user-defined words. What transformations has the code gone through?
lcv
Maybe you will get lucky, and the literals are already in the form G'blah' instead of G'<blah>' :)
lcv