views:

737

answers:

10

I have a CSV file that contains both ASCII & Unicode characters. say "ÅÔÉA". I am not sure abt the encoding format of this file, but when I open it in Notepad, it shows "ANSI" as its encoding standard.

I fetch these contents of CSV in UTF-8 encoded format.

fr = new InputStreamReader(new FileInputStream(fileName),"UTF-8");

but when I store it in DB these special characters, except "A", are not stored properly. the characters get scrambled

I wish all the characters to be stored properly. Any idea?

A: 

Does your database field support unicode? In MSSQL the field type must be nvarchar over varchar. What database are you using?

Sam
A: 

I use Oracle 10G. But, when I insert these characters by myself, they are stored properly.

So I feel the problem is some where in the Java encoding

In what way do you insert the data? With PreparedStatement or you compose the SQL insert as plain String?
kd304
+2  A: 

"ANSI" in "Notepad" means whatever codepage your windows is using. Try ISO8859-1, it work in most case.

J-16 SDiZ
+1  A: 

I suggest creating a small program which reads from the file, and prints out the unicode value of the characters read, so you can ensure that the values shown are correct. There is code charts available at http://www.unicode.org/charts/ - you probably can do with the Basic Latin and Latin-1 charts.

My guess is that the encoding is the native Windows encoding. Then you can avoid the "UTF-8" parameter completely and let Java use the default platform encoding.

Thorbjørn Ravn Andersen
+1  A: 

Had this problem. You need two things: NVARCHAR2 columns and an oracle specific method call on the preparedstatement to notify oracle about the string encoding:

/**
 * Sets a statement parameter as NCHAR. Use before setting the field value. 
 * @param pstmt the prepared statement
 * @param index the parameter index
 */
public static void setNChar(PreparedStatement pstmt, int index) {
    OraclePreparedStatement opstmt = (OraclePreparedStatement)pstmt;
    opstmt.setFormOfUse(index, OraclePreparedStatement.FORM_NCHAR);
}

If you use plain SQL string with unicode characters, that works as Oracle gets all SQL commands in UTF-8: the driver automatically translates. However, for prepared statement use you need to explicitely tell that to oracle.

You could also try PreparedStatement.setNString() if you run Java 6 and have ojdbc6 driver. (In my case we had to use Java 5 with version 4 driver - don't ask why)

(Note: I know this is vendor lock-in as you are forced to use concrete oracle classes instead of the jdbc interfaces)

kd304
+2  A: 

First of all, you need to know the encoding of the file. Open it with a hexeditor. How many byte does a character occupy? If it is only one, then the file is not in UTF-8, but more likely in some ISO-8859 or a similar Windows encoding (e.g. Win-1252). As mentioned before, chances are that ISO-8859-1 is the right encoding. For Eastern Europe languages, ISO-8859-2 would be the right choice.

The second issue would be the character set your database supports for character columns (this parameter is set during installation / creation of a new instance) but since you can insert those characters directly, it wont's be a problem in that case.

Which jdbc driver do you use? The thin driver should not make any problems in that regard, while the OCI driver could create a additional layer of problems if the client's NLS_LANG setting doesn't match the database's character encoding.

ammoQ
I think Oracle 10g allows you to use either UTF-8 or UTF-16 for the national character set column storage format. By default it is UTF-16.
kd304
kd304: you can also specify UTF-8 as the character set for CHAR (VARCHAR, VARCHAR2) columns, so every string in the database is UTF8.
ammoQ
A: 

You need to encode that in ISO 8859-1 and not in UTF-8.

If he/she receives data in UTF-8 why should he/she transcode it into a lesser format? Oracle is quite capable of handling unicode text.
kd304
A: 

You can manually compare a hex dump of character outputs to various encodings (sample code), though this may be a laborious process. Alternatively, you can use the ICU library to try to determine the encoding to use, though this is hardly a foolproof method.

How was the CSV file produced? If it was created by a Windows application on an English OS, then the encoding is likely to be windows-1252 (canonical name "Cp1252" in Java); on a Polish system, it might be windows-1250. The best approach is to find out for certain what encoding the file was saved in.

McDowell
Good idea. You can use oracle's DUMP() function to view how a string is stored in the column byte by byte and compare it to the original file's byte sequence.
kd304
A: 

//opstmt.setFormOfUse(index, OraclePreparedStatement.FORM_NCHAR);

I am using java.sql.PreparedStatement.

Is there any equivalent functionality in PS ?

As I mentioned in my comment, you could try pstmt.setNString() for this but I don't know whether its working correctly with Oracle.
kd304
A: 

The first thing to do is a System.out.println() of the string after you've loaded it from the file. If it's corrupted in the console your file isn't actually UTF-8, if it looks fine you have a problem with the way you're saving it to the database :)

Spyder