views:

110

answers:

2

I have a Unicode (UTF-8 without BOM) text file within a jar, that's loaded as a resource.

URL resource = MyClass.class.getResource("datafile.csv");
InputStream stream = resource.openStream();
BufferedReader reader = new BufferedReader(
    new InputStreamReader(stream, Charset.forName("UTF-8")));

This works fine on Windows, but on Linux it appear not to be reading the file correctly - accented characters are coming out broken. I'm aware that different machines can have different default charsets, but I'm giving it the correct charset. Why would it not be using it?

+1  A: 

I wonder if reviewing UTF-8 on Linux would help. Could be a setup issue.

duffymo
I'm specifying the decoding scheme, which should mean the host machine's set up would be irrelevant.
Marcus Downing
+2  A: 

The reading part looks correct, I use that all the time on Linux.

I suspect you used default encoding somewhere when you export the text to the web page. Due to the different default encoding on Linux and Windows, you saw different result.

For example, you use default encoding if you do anything like this in servlet,

PrintWriter out = response.getWriter();
out.println(text);

You need to specifically write in UTF-8 like this,

 response.setContentType("text/html; charset=UTF-8");
 out = new PrintWriter(
    new OutputStreamWriter(response.getOutputStream(), "UTF-8"), true);
 out.println(text);
ZZ Coder