tags:

views:

252

answers:

1

Hello all,

I have a struts2 web application which accepts both POST and GET requests in many different charsets, does conversion of them into utf-8, displays the correct utf-8 characters on the screen and then writes them into utf-8 database.

I have tried at least 5 different methods for doing simple losless charset conversion of windows-1250 to utf-8 to start with, and all of them did not work. Utf-8 being the "larger set", it should work without a problem (at least this is my understanding).

Can you propose how to do a charset conversion from windows-1250 to utf-8, and is it possible that struts2 is doing something weird with the params charset, which would explain why I can't seem to get it right.

This is my latest attempt:

    String inputData = getSimpleParamValue("some_input_param_from_get");
    Charset inputCharset = Charset.forName("windows-1250");
    Charset utfCharset = Charset.forName("UTF-8");

    CharsetDecoder decoder = inputCharset.newDecoder();
    CharsetEncoder encoder = utfCharset.newEncoder();

    String decodedData = "";
    try {
        ByteBuffer inputBytes = ByteBuffer.wrap(inputData.getBytes()); // I've tried putting UTF-8 here as well, with no luck
        CharBuffer chars = decoder.decode(inputBytes);

        ByteBuffer utfBytes = encoder.encode(chars);
        decodedData = new String(utfBytes.array());

    } catch (CharacterCodingException e) {
        logger.error(e);
    }

Any ideas on what to try to get this working?

Thank you and best regards,

Bozo

A: 

I'm not sure of your scenario. In Java, a String is Unicode, one only deals with charset conversion when has to convert from/to String to/from a binary representation. In your example, when getSimpleParamValue("some_input_param_from_get") is called, inputData should already have the "correct" String, the conversion from the stream of bytes (that travelled from the client browser to the web server) to a string should have already taken part (responsability of the web server+web layer of your application). For this, I enforce UTF-8 for the web trasmission, placing a filter in the web.xml (before Struts), for example:

public class CharsetFilter implements Filter {

    public void destroy() {}

    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
        HttpServletRequest req = (HttpServletRequest) request;
        HttpServletResponse res = (HttpServletResponse) response;
        req.setCharacterEncoding("UTF-8");

        chain.doFilter(req, res);
        String contentType = res.getContentType(); 
        if( contentType !=null && contentType.startsWith("text/html"))
            res.setCharacterEncoding("UTF-8");
    }

    public void init(FilterConfig filterConfig) throws ServletException {
    }
}

If you cannot do this, and if your getSimpleParamValue() "errs" in the charset conversion (eg: it assumed the byte stream was UTF-8 and was windows-1250) you now have an "incorrect" string, and you must try to recover it by undoing and redoing the byte-to-string conversion - in which case you must know the wrong AND the correct charset - and, worse, deal with the possibity of missing chars (if it was interpreted as UTF8, i maight have found illegal char sequence). If you have to deal with this in a Struts2 action, I'd say you are in problems, you should deal with it explicitly before/after it (in the upper web layer - or in the Database driver or File encoding or whatever)

leonbloy
I get charset name in HTTP param, and from this I now have to know how to convert the input params (from i.e. win1250 to UTF-8).
bozo
BTW, I do have similar filter in my web app, but this is not enough for me. So now I have put a PHP filter before Struts which does iconv from the source charset to UTF-8 and this works perfectly. I cannot believe how complicated it is to get the same thing done in Java - very hard to do.
bozo