views:

2929

answers:

8

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.

Note: We use restlet, not tomcat

Original Problem

Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.

Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.

I.e. for a query part looking like

    ...v=abcädef

if "ISO-8859-1" is selected, the sent query part looks like

...v=abc%E4def

but if "UTF-8" is selected, the sent query part looks like

...v=abc%C3%A4def

Desired Solution

As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status

Current Solution In Detail

Check for each character ( == string.substring(i,i+1) )

  1. if character.getBytes()[0] equals 63 for '?'
  2. if Character.getType(character.charAt(0)) returns OTHER_SYMBOL

Code

protected List< String > getNonUnicodeCharacters( String s ) {
  final List< String > result = new ArrayList< String >();
  for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
    final String character = s.substring( i , i + 1 );
    final boolean isOtherSymbol = 
      ( int ) Character.OTHER_SYMBOL
       == Character.getType( character.charAt( 0 ) );
    final boolean isNonUnicode = isOtherSymbol 
      && character.getBytes()[ 0 ] == ( byte ) 63;
    if ( isNonUnicode )
      result.add( character );
  }
  return result;
}

Question

Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?

Note: I checked URLDecoder with the following code

final String[] test = new String[]{
  "v=abc%E4def",
  "v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
    System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
    System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}

This prints:

v=abc?def
v=abcädef
v=abcädef
v=abcädef

and it does not throw an IllegalArgumentException sigh

+1  A: 

URLDecoder will decode to a given encoding. This should flag errors appropriately. However the documentation states:

There are two possible ways in which this decoder could deal with illegal strings. It could either leave illegal characters alone or it could throw an IllegalArgumentException. Which approach the decoder takes is left to the implementation.

So you should probably try it. Note also (from the decode() method documentation):

The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites

so there's something else to think about!

EDIT: Apache Commons URLDecode claims to throw appropriate exceptions for bad encodings.

Brian Agnew
I know of the Recommendation, but what about the browser (here Firefox 3.0.10) violating it? As long as it is recommended and not required you have to make sure that there are no illegal entities, don't you?
dhiller
So I would try decoding using the URLDecoder and choosing the appropriate encoding. I would be interested (!) to see if the URLDecoder *does* throw exceptions on illegally encoded characters (easy to test outside the browser/server environment)
Brian Agnew
Sorry. Just saw your edited question re. illegal chars
Brian Agnew
A: 
daniel
string.getBytes() with new String() is a classic bug which should be avoid
Dennis Cheung
+6  A: 

I asked the same question,

http://stackoverflow.com/questions/1233076/handling-character-encoding-in-uri-on-tomcat

I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,

  1. Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
  2. If you have to manually URL decode, use Latin1 as charset also.
  3. Use the fixEncoding() function to fix up encodings.

For example, to get a parameter from query string,

  String name = fixEncoding(request.getParameter("name"));

You can do this always. String with correct encoding is not changed.

The code is attached. Good luck!

 public static String fixEncoding(String latin1) {
  try {
   byte[] bytes = latin1.getBytes("ISO-8859-1");
   if (!validUTF8(bytes))
    return latin1;   
   return new String(bytes, "UTF-8");  
  } catch (UnsupportedEncodingException e) {
   // Impossible, throw unchecked
   throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
  }

 }

 public static boolean validUTF8(byte[] input) {
  int i = 0;
  // Check for BOM
  if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
    && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
   i = 3;
  }

  int end;
  for (int j = input.length; i < j; ++i) {
   int octet = input[i];
   if ((octet & 0x80) == 0) {
    continue; // ASCII
   }

   // Check for UTF-8 leading byte
   if ((octet & 0xE0) == 0xC0) {
    end = i + 1;
   } else if ((octet & 0xF0) == 0xE0) {
    end = i + 2;
   } else if ((octet & 0xF8) == 0xF0) {
    end = i + 3;
   } else {
    // Java only supports BMP so 3 is max
    return false;
   }

   while (i < end) {
    i++;
    octet = input[i];
    if ((octet & 0xC0) != 0x80) {
     // Not a valid trailing byte
     return false;
    }
   }
  }
  return true;
 }

EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?

Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.

The solution I posted here is not perfect but it's the best one we found so far.

ZZ Coder
Nice one! But I have to object to your comment "Java only supports BMP". The four-byte limit on UTF-8 byte sequences was imposed by the Unicode Consortium, and it's sufficient to handle the complete range of characters (U+0000..U+10FFFF), not just the BMP.
Alan Moore
The correct comment probably should be "We only care about BMP". My impression was that surrogate pair doesn't work well in Java.
ZZ Coder
Well, I asked in May ;-) Anyway, what does the above code do? Does it convert from iso to utf-8? I would not want to convert the code, just check wether the encoding is right and throw an error if it's not. Please see my solution above again and check if it's correct, will you?
dhiller
Your solution is not going to work. If wrong encoding is used, you will get question marks, instead of exception. Just use my function validUTF8(). If it's true, it's MOST LIKELY is UTF8. Otherwise, it's Latin-1. You have to use Latin-1 encoding everywhere in the server for this check to work.
ZZ Coder
Yes, as I stated : 1. check if character.getBytes()[0] equals 63 for '?', 2. check if Character.getType(character.charAt(0)) returns OTHER_SYMBOL. And this _does_ work for me. If you can prove the opposite, please let me know...
dhiller
See my edit .................
ZZ Coder
@ZZ Coder: your code correctly detects four-byte UTF-8 sequences, which is the maximum allowed by the Unicode spec, so that comment doesn't really make sense. When the text is converted to Java strings, those four-byte sequences will become surrogate pairs, which Java handles correctly--just not transparently.
Alan Moore
@ZZ Coder: At first thank you for your time. There seems to have been some misunderstanding because of my inprecise question, which I've tried to clarify. Please see my edits. At second: I disagree with your "you shouldn't reject any..." proposal, because we are on interface level. I have to make sure that the service user always uses the correct encoding. If my solution is wrong, how else can I achieve that?
dhiller
@ZZ Coder: Could you please add some comments to your code to help me understand what you are doing?
dhiller
+1  A: 

I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.

To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:

  1. No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
  2. Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
  3. Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).

If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.

Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.

(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)

Adrian McCarthy
A: 

the following regular expression might be of interest for you:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624

I use it in ruby as following:

module Encoding
UTF8RGX = /\A(
    [\x09\x0A\x0D\x20-\x7E]            # ASCII
  | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
  |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
  | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
  |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
  |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
  | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
  |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*\z/x unless defined? UTF8RGX

def self.utf8_file?(fileName)
  count = 0
  File.open("#{fileName}").each do |l|
    count += 1
    unless utf8_string?(l)
      puts count.to_s + ": " + l
    end
  end
  return true
end

def self.utf8_string?(a_string)
  UTF8RGX === a_string
end

end

dimus
+3  A: 

You can use a CharsetDecoder configured to throw an exception if invalid chars are found:

 CharsetDecoder UTF8Decoder =
      Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);

See CodingErrorAction.REPORT

ante
A: 

Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)

One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.

Dennis Cheung
A: 

You might want to include a known parameter in your requests, e.g. "...&encTest=ä€", to safely differentiate between the different encodings.

mfx