views:

2490

answers:

8

An application I am working on reads information from files to populate a database. Some of the characters in the files are non-English, for example accented French characters.

The application is working fine in Windows but on our Solaris machine it is failing to recognise the special characters and is throwing an exception. For example when it encounters the accented e in "Gérer" it says :-

      Encountered: "\u0161" (353), after : "\'G\u00c3\u00a9rer les mod\u00c3"

(an exception which is thrown from our application)

I suspect that in order to stop this from happening I need to change the file.encoding property of the JVM. I tried to do this via System.setProperty() but it has not stopped the error from occurring.

Are there any suggestions for what I could do? I was thinking about setting the basic locale of the solaris platform in /etc/default/init to be UTF-8. Does anyone think this might help?

Any thoughts are much appreciated.

+3  A: 

Try to use

java -Dfile.encoding=UTF-8 ...

when starting the application in both systems.

Another way to solve the problem is to change the encoding from both system to UTF-8, but i prefer the first option (less intrusive on the system).

EDIT:

Check this answer on stackoverflow, It might help either:

http://stackoverflow.com/questions/81323/changing-the-default-encoding-for-stringbyte

sakana
yeah I have seen that before. My only problem is that I cant find where the java command is actually run. This is because the program is using Ant to run the code. Thanks for your answer though I will try to put it to use.
Scottm
A: 

You can also set the encoding at the command line, like so java -Dfile.encoding=utf-8.

sblundy
+4  A: 

That looks like a file that was converted by native2ascii using the wrong parameters. To demonstrate, create a file with the contents

Gérer les modÚ

and save it as "a.txt" with the encoding UTF-8. Then run this command:

native2ascii -encoding windows-1252 a.txt b.txt

Open the new file and you should see this:

G\u00c3\u00a9rer les mod\u00c3\u0161

Now reverse the process, but specify ISO-8859-1 this time:

native2ascii -reverse -encoding ISO-8859-1 b.txt c.txt

Read the new file as UTF-8 and you should see this:

Gérer les modÀ\u0161

It recovers the "é" okay, but chokes on the "Ú", like your app did.

I don't know what all is going wrong in your app, but I'm pretty sure incorrect use of native2ascii is part of it. And that was probably the result of letting the app use the system default encoding. You should always specify the encoding when you save text, whether it's to a file or a database or what--never let it default. And if you don't have a good reason to choose something else, use UTF-8.

Alan Moore
good answer - I will look into your suggestion. Thanks
Scottm
+1  A: 

Instead of setting the system-wide character encoding, it might be easier and more robust, to specify the character encoding when reading and writing specific text data. How is your application reading the files? All the Java I/O package readers and writers support passing in a character encoding name to be used when reading/writing text to/from bytes. If you don't specify one, it will then use the platform default encoding, as you are likely experiencing.

Some databases are surprisingly limited in the text encodings they can accept. If your Java application reads the files as text, in the proper encoding, then it can output it to the database however it needs it. If your database doesn't support any encoding whose character repetoire includes the non-ASCII characters you have, then you may need to encode your non-English text first, for example into UTF-8 bytes, then Base64 encode those bytes as ASCII text.

PS: Never use String.getBytes() with no character encoding argument for exactly the reasons you are seeing.

Dov Wasserman
A: 

Hi Scott, I think we'll need more information to be able to help you with your problem:

  1. What exception are you getting exactly, and which method are you calling when it occurs.
  2. What is the encoding of the input file? UTF8? UTF16/Unicode? ISO8859-1?

It'll also be helpful if you could provide us with relevant code snippets.

Also, a few things I want to point out:

  1. The problem isn't occurring at the 'é' but later on.
  2. It sounds like the character encoding may be hard coded in your application somewhere.
Jack Leow
The exception is one that is defined in our software, it is thrown when the parser has tried everything but still does not recognise the character.The encoding it is using is the system default, this was set to en_GB.ISO8859-15 by default. Im looking for a way to force the application to read UTF8
Scottm
A: 

Also, you may want to verify that operating system packages to support UTF-8 (SUNWeulux, SUNWeuluf etc) are installed.

Jack Leow
+1  A: 

hi all. I managed to get past this error by running the command

export LC_ALL='en_GB.UTF-8'

This command set the locale for the shell that I was in. This set all of the LC_ environment variables to the Unicode file encoding.

Many thanks for all of your suggestions.

Scottm
A: 

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.

In Java you can use following for reading and writing:

Reading:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));

Writing:

PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));
mohitsoni