views:

17414

answers:

7

I need to get UTF-8 working in my Java webapp (servlets + JSP, no framework used) to support äöå etc. for regular Finnish text and Cyrillic alphabets like ЦжФ for special cases.

My setup is the following:

  • Development encironment: Windows XP
  • Production encironment: Debian

Database used: MySQL 5.x

Users mainly use Firefox2 but also Opera 9.x, FF3, IE7 and Google Chrome are used to access the site.

How to achieve this?

+56  A: 

Answering myself as the FAQ of this site encourages it. This works for me:

Mostly characters äåö are not a problematic as the default character set used by browsers and tomcat/java for webapps is latin1 ie. ISO-8859-1 which "understands" those characters.

To get UTF-8 working under Java+Tomcat+Linux/Windows+Mysql requires the following:

Configuring Tomcat's server.xml

It's necessary to configure that the connector uses UTF-8 to encode url (GET request) parameters:

<Connector port="8080" maxHttpHeaderSize="8192"
 maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
 enableLookups="false" redirectPort="8443" acceptCount="100"
 connectionTimeout="20000" disableUploadTimeout="true" 
 compression="on" 
 compressionMinSize="128" 
 noCompressionUserAgents="gozilla, traviata" 
 compressableMimeType="text/html,text/xml,text/plain,text/css,text/ javascript,application/x-javascript,application/javascript"
 URIEncoding="UTF-8"
/>

The key part being URIEncoding="UTF-8" in the above example. This quarantees that Tomcat handles all incoming GET parameters as UTF-8 encoded. As a result, when the user writes the following to the address bar of the browser:

 https://localhost:8443/ID/Users?action=search&amp;name=*ж*

the character ж is handled as UTF-8 and is encoded to (usually by the browser before even getting to the server) as %D0%B6.

POST request are not affected by this.

CharsetFilter

Then it's time to force the java webapp to handle all requests and responses as UTF-8 encoded. This requires that we define a character set filter like the following:

  package fi.foo.filters;

  import java.io.IOException;
  import javax.servlet.Filter;
  import javax.servlet.FilterChain;
  import javax.servlet.FilterConfig;
  import javax.servlet.ServletException;
  import javax.servlet.ServletRequest;
  import javax.servlet.ServletResponse;

  public class CharsetFilter implements Filter
   {
   private String encoding;

   public void init(FilterConfig config) throws ServletException
   {
    encoding = config.getInitParameter("requestEncoding");

    if( encoding==null ) encoding="UTF-8";
   }

   public void doFilter(ServletRequest request, ServletResponse response, FilterChain       next)
   throws IOException, ServletException
   {
    // Respect the client-specified character encoding
    // (see HTTP specification section 3.4.1)
    if(null == request.getCharacterEncoding())
      request.setCharacterEncoding(encoding);


    /**
 * Set the default response content type and encoding
 */
 response.setContentType("text/html; charset=UTF-8");
 response.setCharacterEncoding("UTF-8");


    next.doFilter(request, response);
   }

    public void destroy(){}
   }

This filter makes sure that if the browser hasn't set the encoding used in the request, that it's set to UTF-8.

The other thing done by this filter is to set the default response encoding ie. the encoding in which the returned html/whatever is. The alternative is to set the response encoding etc. in each controller of the application.

This filter has to be added to the web.xml or the deployment descriptor of the webapp:

 <!--CharsetFilter start--> 

  <filter>
    <filter-name>CharsetFilter</filter-name>
    <filter-class>fi.foo.filters.CharsetFilter</filter-class>
      <init-param>
        <param-name>requestEncoding</param-name>
        <param-value>UTF-8</param-value>
      </init-param>
  </filter>

  <filter-mapping>
    <filter-name>CharsetFilter</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>

The instructions for making this filter are found at the tomcat wiki (http://wiki.apache.org/tomcat/Tomcat/UTF-8)

JSP page encoding

All JSP-pages of the webapp need to have the following at the top of them:

 <%@page pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"%>

If some kind of a layout with different JSP-fragments is used, then this is needed in all of them.

HMTL-meta tags

JSP page encoding tells the JVM to handle the characters in the JSP page in the correct encoding. Then it's time to tell the vrowser in which encoding the html page is:

This is done with the following at the top of each xhtml page produced by the webapp:

   <?xml version="1.0" encoding="UTF-8"?>
   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"&gt;
   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fi">
   <head>
   <meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
   ...

JDBC-connection

When using a db, it has to be defined that the connection uses UTF-8 encoding. This is done in context.xml or wherever the JDBC connection is defiend as follows:

      <Resource name="jdbc/AppDB" 
        auth="Container"
        type="javax.sql.DataSource"
        maxActive="20" maxIdle="10" maxWait="10000"
        username="foo"
        password="bar"
        driverClassName="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/      ID_development?useEncoding=true&amp;characterEncoding=UTF-8"
    />

MySQL database and tables

The used database must use UTF-8 encoding. This is achieved by creating the database with the following:

   CREATE DATABASE `ID_development` 
   /*!40100 DEFAULT CHARACTER SET utf8 COLLATE utf8_swedish_ci */;

Then, all of the tables need to be in UTF-8 also:

   CREATE TABLE  `Users` (
    `id` int(10) unsigned NOT NULL auto_increment,
    `name` varchar(30) collate utf8_swedish_ci default NULL
    PRIMARY KEY  (`id`)
   ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci ROW_FORMAT=DYNAMIC;

The key part being CHARSET=utf8.

MySQL server configuration

MySQL serveri has to be configured also. Tupically this is done in Windows by modifying my.ini -file and in Linux by configuring my.cnf -file. In those files it should be defined that all clients connected to the server use utf8 as the default character set and that the default charset used by the server is also utf8.

   [client]
   port=3306
   default-character-set=utf8

   [mysql]
   default-character-set=utf8

Mysql procedures and functions

These also need to have the character set defined. For example:

   DELIMITER $$

   DROP FUNCTION IF EXISTS `pathToNode` $$
   CREATE FUNCTION `pathToNode` (ryhma_id INT) RETURNS TEXT CHARACTER SET utf8
   READS SQL DATA
   BEGIN

    DECLARE path VARCHAR(255) CHARACTER SET utf8;

   SET path = NULL;

   ...

   RETURN path;

   END $$

   DELIMITER ;

GET requests: latin1 and UTF-8

If and when it's defined in tomcat's server.xml that GET request parameters are encoded in UTF-8, the following GET requests are handled properly:

   https://localhost:8443/ID/Users?action=search&amp;name=Petteri
   https://localhost:8443/ID/Users?action=search&amp;name=ж

Because ASCII-characters are encoded in the same way both with latin1 and UTF-8, the string "Petteri" is handled correctly.

The Cyrillic character ж is not understood at all in latin1. Because Tomcat is instructed to handle request parameters as UTF-8 it encodes that character correctly as %D0%B6.

If and when browsers are instructed to read the pages in UTF-8 encoding (with request headers and html meta-tag), at least Firefox 2/3 and other browsers from this period all encode the character themselves as %D0%B6.

The end result is that all users with name "Petteri" are found and also all users with the name "ж" are found.

But what about äåö?

HTTP-specification defines that by default URLs are encoded as latin1. This results in firefox2, firefox3 etc. encoding the following

    https://localhost:8443/ID/Users?action=search&amp;name=*Päivi*

in to the encoded version

    https://localhost:8443/ID/Users?action=search&amp;name=*P%E4ivi*

In latin1 the character ä is encoded as %E4. Even though the page/request/everything is defined to use UTF-8. The UTF-8 encoded version of ä is %C3%A4

The result of this is that it's quite impossible for the webapp to correly handle the request parameters from GET requests as some characters are encoded in latin1 and others in UTF-8. Notice: POST requests do work as browsers encode all request parameters from forms completely in UTF-8 if the page is defined as being UTF-8

Stuff to read

A very big thank you for the writers of the following for giving the answers for my problem:

  • http://tagunov.tripod.com/i18n/i18n.html>http://tagunov.tripod.com/i18n/i18n.html
  • http://wiki.apache.org/tomcat/Tomcat/UTF-8
  • http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/
  • http://dev.mysql.com/doc/refman/5.0/en/charset-syntax.html
  • http://cagan327.blogspot.com/2006/05/utf-8-encoding-fix-tomcat-jsp-etc.html
  • http://cagan327.blogspot.com/2006/05/utf-8-encoding-fix-for-mysql-tomcat.html
  • http://jeppesn.dk/utf-8.html
  • http://www.nabble.com/request-parameters-mishandle-utf-8-encoding-td18720039.html
  • http://www.utoronto.ca/webdocs/HTMLdocs/NewHTML/iso_table.html
  • http://www.utf8-chartable.de/
kosoant
These steps also work with Struts/tiles and a postgres database.
kosoant
cool. nice answer.
anjanb
In the server.xml there is a mistake. It is not "compession" it is "compression".
Julien Chastang
fantastic! Excellent answer which solved my encoding problem. Thanks!
Mads Mobæk
Wow. Such a nice summary, gonna store this for future reference. Thanks!
miek
Two comments: 1) in **HMTL-meta tags** you included a xml declaration. Remove it, it would only trigger browsers in quirks mode, you don't want to have that. Also, the HTML meta tags are in fact already implicitly done by JSP `pageEncoding`, so you could even leave it away. 2) in **MySQL database and tables** you used `utf8_swedish_si`, this should have been `utf8_unicode_ci`. You could even leave the collation away, just `CHARACTER SET utf8` is enough.
BalusC
A: 

I think you summed it up quite well in your own answer.

In the process of UTF-8-ing(?) from end to end you might also want to make sure java itself is using UTF-8. Use -Dfile.encoding=utf-8 as parameter to the JVM (can be configured in catalina.bat).

stian
A: 

This is for Greek Encoding in MySql tables when we want to access them using Java:

Use the following connection setup in your JBoss connection pool (mysql-ds.xml)

<connection-url>jdbc:mysql://192.168.10.123:3308/mydatabase</connection-url>
<driver-class>com.mysql.jdbc.Driver</driver-class>
<user-name>nts</user-name>
<password>xaxaxa!</password>
<connection-property name="useUnicode">true</connection-property>
<connection-property name="characterEncoding">greek</connection-property>

If you don't want to put this in a JNDI connection pool, you can configure it as a JDBC-url like the next line illustrates:

jdbc:mysql://192.168.10.123:3308/mydatabase?characterEncoding=greek

For me and Nick, so we never forget it and waste time anymore.....

Mike Mountrakis
I would still prefer UTF-8 above Greek (and convert your current Greek data to UTF-8) so that your application is ready for world domination.
BalusC
A: 

Also during connection aquition, you can the following code use:

DriverManager.registerDriver(new com.mysql.jdbc.Driver());
 Connection conn =  DriverManager.getConnection(("jdbc:mysql://192.168.1.1:3308/nts?characterEncoding=greek","myuser","mypass");
Mike Mountrakis
`Greek` isn't `UTF-8`. Also, any decent JDBC driver will just use the DB table's encoding.
BalusC
A: 

In case you have specified in connection pool (mysql-ds.xml),

in your Java code you can open the connection as follows:

DriverManager.registerDriver(new com.mysql.jdbc.Driver()); Connection conn = DriverManager.getConnection("jdbc:mysql://192.168.1.12:3308/mydb?characterEncoding=greek", "Myuser", "mypass");

Mike Mountrakis
A: 

Nice detailed answer. just wanted to add one more thing which will definitely help others to see the UTF-8 encoding on URLs in action .

Follow the steps below to enable UTF-8 encoding on URLs in firefox.

  1. type "about:config" in the address bar.

  2. Use the filter input type to search for "network.standard-url.encode-query-utf8" property.

  3. the above property will be false by default, turn that to TRUE.
  4. restart the browser.

UTF-8 encoding on URLs works by default in IE6/7/8 and chrome.

Jay
A: 

great post, it surely helps a lot of people, i want also to add tha from here http://wiki.netbeans.org/FaqI18nProjectEncoding this part solved my utf problem runtime.encoding=<encoding> However i still didnt set the required filter. Ok its easy to create the file with the class code, but afterwards what should i do so that tomcat would be able to recognise it? do i have to compile this file and put it in the bin folder of tomcat?

John