ansaurus

Question

Answer 1

A:

java.util.regex ?

Donz 2010-10-01 23:18:40

How so? You must be able to justify your answer.

The Elite Gentleman 2010-10-01 23:40:38

I know what the initial reaction is, just use Matcher.replaceAll("\r\n"); and that works fine when hard coding it. But when it's a user provided string Matcher.replaceAll(user_string); and user_string contains the textual representation of \r\n \uXXXXX \1 etc..I need to get the textual version into the actual string literal form. and yes I could replace "\\r" with \r etc but was hoping there be an easier way.

Nisse 2010-10-01 23:46:15

Easier way to use alredy exists lib as they said above. But I think this lib uses hardcode mapping of escaped strings to regex contructions.

Donz 2010-10-05 10:05:09

Answer 2

+2 A:

I'm not directly familiar with an 'easy' way of handling this (i.e. I am note aware whether there is a built in library which can handle this). But one way of doing this, is to read up on the JLS specification on escape sequences and write a single-pass parser, which can find and evaluate each escape sequences. (Check the JLS here).

Now I know for a fact, that there are a few oddities which you quickly will forget to handle, for example octal escapes are tricky, because it allows either 1,2 or 3 digits, and in each case there are different allowed values for when it's an escape, and when it's just the integer.

Take the example string "\431" this is the ocal escape '\43' concatenated with the character '1', because the first digit of the octal escape is 4, and thus cannot be a full three digit octal value as it only allows [0-3] as the first digit in that case.

Just about a year ago I was co-writing a Java compiler for a subset of the 1.3 specification, which does have escape sequences, and below here I have included our code for handling escapes - you should actually be able to take this code literally as it is and include in a Utility class (maybe throw in a credit if you feel charitable):

private String processCharEscapes(String strVal) {
    // Loop helpers
    char[]          chrArr = strVal.toCharArray();
    StringBuilder   strOut = new StringBuilder(strVal.length());
    String          strEsc = "";    // Escape sequence, string buffer
    Character       chrBuf = null;  // Dangling character buffer

    // Control flags
    boolean inEscape    = false;    // In escape?
    boolean cbOctal3    = true;     // Can be octal 3-digit

    // Parse characters
    for(char c : chrArr) {
        if (!inEscape) {
            // Listen for start of escape sequence
            if (c == '\\') {
                inEscape = true;    // Enter escape
                strEsc = "";        // Reset escape buffer
                chrBuf = null;      // Reset dangling character buffer
                cbOctal3 = true;    // Reset cbOctal3 flag
            } else {
                strOut.append(c);   // Save to output
            }
        } else {
            // Determine escape termination
            if (strEsc.length() == 0) { // First character
                if (c >= 48 && c <= 55) {   // c is a digit [0-7]
                    if (c > 51) {   // c is a digit [4-7]
                        cbOctal3 = false;
                    }
                    strEsc += c;    // Save to buffer
                } else {    // c is a character
                    // Single-character escapes (will terminate escape loop)
                    if (c == 'n') {
                        inEscape = false;
                        strOut.append('\n');
                    } else if(c == 't') {
                        inEscape = false;
                        strOut.append('\t');
                    } else if(c == 'b') {
                        inEscape = false;
                        strOut.append('\b');
                    } else if(c == 'r') {
                        inEscape = false;
                        strOut.append('\r');
                    } else if(c == 'f') {
                        inEscape = false;
                        strOut.append('\f');
                    } else if(c == '\\') {
                        inEscape = false;
                        strOut.append('\\');
                    } else if(c == '\'') {
                        inEscape = false;
                        strOut.append('\'');
                    } else if(c == '"') {
                        inEscape = false;
                        strOut.append('"');
                    } else {
                        // Saw illegal character, after escape character '\'
                        System.err.println(ErrorType.SYNTAX_ERROR, "Illegal character escape sequence, unrecognised escape: \\" + c);
                    }
                }
            } else if(strEsc.length() == 1) {   // Second character (possibly)
                if (c >= 48 && c <= 55) {   // c is a digit [0-7]
                    strEsc += c;    // Save to buffer
                    if (!cbOctal3) {    // Terminate since !cbOctal3
                        inEscape = false;
                    }
                } else {
                    inEscape = false;   // Terminate since c is not a digit
                    chrBuf = c;         // Save dangling character
                }
            } else if(strEsc.length() == 2) {   // Third character (possibly)
                if (cbOctal3 && c >= 48 && c <= 55) {
                    strEsc += c;        // Save to buffer
                } else {
                    chrBuf = c;         // Save dangling character
                }
                inEscape = false;       // Will always terminate after third character, no matter what
            }

            // Did escape sequence terminate, at character c?
            if (!inEscape && strEsc.length() > 0) {
                // strEsc is legal 1-3 digit octal char code, convert and add
                strOut.append((char)Integer.parseInt(strEsc, 8));

                if (chrBuf != null) {   // There was a dangling character
                    // Check for chained escape sequences (e.g. \10\10)
                    if (chrBuf == '\\') {
                        inEscape = true;    // Enter escape
                        strEsc = "";        // Reset escape buffer
                        chrBuf = null;      // Reset dangling character buffer
                        cbOctal3 = true;    // Reset cbOctal3 flag
                    } else {
                        strOut.append(chrBuf);
                    }
                }
            }
        }
    }

    // Check for EOL-terminated escape sequence (special case)
    if (inEscape) {
        // strEsc is legal 1-3 digit octal char code, convert and add
        strOut.append((char)Integer.parseInt(strEsc, 8));

        if (chrBuf != null) {   // There was a dangling character
            strOut.append(chrBuf);
        }
    }

    return strOut.toString();
}

I hope this helps you.

micdah 2010-10-02 01:33:49

Am I right in thinking that this code doesn't deal with Unicode escapes? That's understandable for the original use-case for this code. But the OP's comment on another answer suggests that he needs unicode escapes to be translated as well.

Stephen C 2010-10-02 03:07:28

Ah yes, you are quite right - in the heat of the answer, I must have forgot that one of the reductions was that we didn't support Unicode. So yes indeed, there is no Unicode support in the above, so the above code can only at most form the skeleton around an implementation including that. Great that you caught that. :-)

micdah 2010-10-02 05:08:03

Thanks for taking the time to reply and provide the code. Btw, for unicode I could add the code I have which deals with unicode escapes.if (c == '\\') { if (i < len) { c = s.charAt(i++); if (c == 'u') { c = (char) Integer.parseInt(s.substring(i, i + 4), 16); i += 4; } } } }Btw, isn't octals supposed to start with \o? I also need to be able to handle group replacements such as \1 and up.

Nisse 2010-10-02 10:16:59

You are welcome, but the reply by Stephen was nicely short and easy to use. :-) But regarding octal escapes, the no, octal escapes are '\\[0-9]{1,2}|\\[0-3][0-9]{2}', it's only when doing Unicode escape that you need the 'u' to indicate unicode as opposed to octal.

micdah 2010-10-02 11:01:59

Answer 3

+1 A:

The Apache commons StringEscapeUtils.unescapeJava(...) methods will do the job. Though it is not clear from the javadoc description, these methods handle unicode escapes as well as "ordinary" Java String escapes.

Stephen C 2010-10-02 01:46:44

ok this is exactly what I'm looking for. Thanks!!

Nisse 2010-10-02 09:54:13

ansaurus

tags:

views:

answers:

Interpreting a Java String

related questions