views:

325

answers:

3

Some pseudocode:

String a = "A bunch of text"; //UTF-16
saveTextInDb(a); //Write to Oracle VARCHAR(15) column
String b = readTextFromDb(); //UTF-16
out.write(b); //Write to http response

When you save the Java String (UTF-16) to Oracle VARCHAR(15) does Oracle also store this as UTF-16? Does the length of an Oracle VARCHAR refer to number of Unicode characters (and not number of bytes)?

When we write b to the ServletResponse is this being written as UTF-16 or are we by default converting to another encoding like UTF-8?

+3  A: 

Instead of UTF-16, think of 'internal representation' of your string. A string in Java is some sort of characters, you don't care which encoding is used internally. Encoding becomes relevant, if you interact with the outside of the program. In your example saveTextInDb, readTextFromDb and write do that. Every time you exchange strings with the outside, an encoding for conversion is used. saveTextInDb (and read) look like self-made methods, at least I don't know them. So you should look up, which encoding is used for this methods. The method write of a Writer always creates bytes, that represent an encoding associated with the writer. If you get your Writer from a HttpServletResponse, the encoding associated is the one used for outputting the response (that will be send in the headers).

response.setEncoding("UTF-8");
Writer out = response.getWriter();

This code returns with out a Writer, that translates the strings into UTF-8-encoding. Similar if you write to a file:

Writer fileout = new OutputStreamWriter(new FileOutputStream(myfile), "ISO8859-1");

If you access a DB, the framework you use should ensure a consistent exchange of strings with the database.

Dishayloo
+2  A: 

The ServletResponse will use ISO 8859-1 (Latin 1) by default. UTF-8 is the most common encoding used for HTTP responses that require Unicode, but you have to set that encoding specifically.

According to this document Oracle can support either UTF-8 or UTF-16 in the database. Your methods that read/write Oracle will need to use the appropriate encoding that matches how the database is set up, and translate that to/from the Java internal representation.

David Gelhar
+3  A: 

The ability of Oracle to store (and later retrieve) Unicode text from the database relies only on the character set of the database (usually specified during database creation). Choosing AL32UTF8 as the character set is recommended for storage of Unicode text in CHAR datatypes (including VARCHAR/VARCHAR2).

Assuming this is done, it is the Oracle JDBC driver that is responsible for conversion of UTF-16 encoded data into AL32UTF8. This "automatic" conversion between encodings also happens when data is read from the database. To answer the query on byte length of VARCHAR, the definition of a VARCHAR2 column in Oracle involves byte semantics - VARCHAR2(n) is used to define a column that can store n bytes (this is the default behavior, as specified by the NLS_LENGTH_SEMANTICS parameter of the database); if you need to define the size based on characters VARCHAR2(n CHAR) is to be used.

The encoding of the data written to the ServletResponse object, depends on the default character encoding, unless this is specified via the ServletResponse.setCharacterEncoding() or ServletResponse.setContentType() API calls. All in all, for a complete Unicode solution involving an Oracle database, one must have knowledge of

  1. The encoding of the incoming data (i.e. the encoding of the data read via the ServletRequest object). This can be done via specifying the accepted encoding in the HTML forms via the accept-charset attribute. If the encoding is unknown, the application could attempt to set it to a known value via the ServletRequest.setCharacterEncoding() method. This method doesnt change the existing encoding, though. So if the input stream is in ISO-Latin1, specifying a different encoding does not reencode data.
  2. The encoding of the data read from streams, as opposed to data created with in the JVM. This is quite important, since the encoding of data read from streams, cannot be changed. Therefore, the encoding of String objects created from an incoming HTTP request cannot be changed. New String objects however, can be created with a defined encoding.
  3. The database character set of the Oracle instance. As indicated previously, data will be stored in the Oracle database, in the defined character set (for CHAR datatypes). The Oracle JDBC driver takes care of conversion of data between UTF-16 and AL32UTF8. If another character set is involved, an additional level of conversion is performed transparently by the JDBC driver.
Vineet Reynolds