views:

455

answers:

3

I'm using Java and Spring's JdbcTemplate class to build an SQL query in Java that queries a Postgres database. However, I'm having trouble executing queries that contain foreign/accented characters.

For example the (trimmed) code:

JdbcTemplate select = new JdbcTemplate( postgresDatabase );

String query = "SELECT id FROM province WHERE name = 'Ontario';";

Integer id = select.queryForObject( query, Integer.class );

will retrieve the province id, but if instead I did name = 'Québec' then the query fails to return any results (this value is in the database so the problem isn't that it's missing).

I believe the source of the problem is that the database I am required to use has the default client encoding set to SQL_ASCII, which according to this prevents automatic character set conversions. (The Java environments encoding is set to 'UTF-8' while I'm told the database uses 'LATIN1' / 'ISO-8859-1')

I was able to manually indicate the encoding when the resultSets contained values with foreign characters as a solution to a previous problem with a similar nature.

Ex:

String provinceName = new String ( resultSet.getBytes( "name" ), "ISO-8859-1" );

But now that the foreign characters are part of the query itself this approach hasn't been successful. (I suppose since the query has to be saved in a String before being executed anyway, breaking it down into bytes and then changing the encoding only muddles the characters further.)

Is there a way around this without having to change the properties of the database or reconstruct it?

PostScript: I found this function on StackOverflow when making up a title, it didn't seem to work (I might not have used it correctly, but even if it did work it doesn't seem like it could be the best solution.):

Edit: I have selected my own answer for this, since it will be what I am using for now; however, as mentioned in a comment below I would be happy to look at other suggestions that may be better, for as long as I have access to the database.

A: 

In fact if your database is "SQL_ASCII" encoded, it basically understands ASCII and nothing else. It means that the word "Québec" has been stored "as provided", meaning "as provided as a set of bytes, according to the encoding used by the tool which processed the insert or update sql order against the database at this moment". Consequently, when you try to select such values, you must use the same encoding, but you have to know beforehand which one it is.

This first matter handled, you then need a way to express that your request should use this encoding.

Let's say that it has been stored with ISO-8859-1 encoding.

I'm not sure that it might work, but I would try something like that :

String myReq = "SELECT id FROM province WHERE name = 'Québec';";
byte[] iso8859sequence = myReq.getBytes("ISO-8859-1");
String myReqAscii = new String(iso8859sequence, "US-ASCII");
Integer id = select.queryForObject( query, Integer.class );
zim2001
Hmm, tried it and didn't get any results. Even tried it with 'UTF-8'. (I assume you meant to use 'myReqAscii' as the query instead of 'query').
FromCanada
+1  A: 

Hmm okay, after slugging through the postgreSQL documentation, I found a solution in the String Functions and Operators section.

I used the convert(string bytea, src_encoding name, dest_encoding name) function and managed to get the province id for Québec.

Ex.

String query = "SELECT id FROM province WHERE name = convert( 'Québec', 'UTF-8', 'ISO-8859-1' );"; 
FromCanada
Still then, I would fix all the layers to use UTF-8 all the way or at least the one and the same character encoding which supports all the characters which are to be handled by the application (maybe ISO-8859-1 is sufficient). Using SQL functions and/or programmatic workarounds to massage the strings would only make it all cumbersome, unmaintainable and unportable.
BalusC
I agree that the best approach would be to have the same character encoding for all layers, but I don't have any control over the database, so I essentially have to use it as is. The piece of code I'm writing is part of a larger project, were we did agree to use UTF-8 for all other purposes. Though I'm just a student on co-op, so I didn't really question the decision, but I guess I could always see if it's absolutely necessary.
FromCanada
+1  A: 

If you are connecting from Java with encoding UTF-8 and the database is ISO-8859-1, then you should run this SQL command just after you initially connect to the DB:

SET client_encoding = 'UTF8';

PostgreSQL will then interpret all input as UTF-8 and then convert it to ISO-8859-1 on the server side. You shouldn't have to do anything other than that.

Matthew Wood
Could this cause issues if someone else needed to use the database for another application?
FromCanada
If you set it manually like this (i.e. not in the postgresql.conf file as the default), then it is only active within that particular session/connection. The only case I can see where this could cause a problem would be if you were sharing a pool of database connections across multiple applications that were a mix of LATIN-1 and UTF-8.
Matthew Wood
I tried and got this error:`The server's client_encoding parameter was changed to UTF8. The JDBC driver requires client_encoding to be UNICODE for correct operation..`Changing it to UNICODE stopped the error, but then the same problem of not getting the id was still there.
FromCanada
Okay, didn't realize that JDBC now detects changing the client_encoding as an error. Also "UNICODE" and "UTF8" are the same thing on the PostgreSQL side, so it shouldn't make a difference. I will have to contemplate this further...
Matthew Wood
Well I have found something that gets the job done, so don't let it trouble you too much. However, I will be happy to test any other suggestions you wish to put forward.
FromCanada