tags:

views:

645

answers:

4

I have a JSP page retrieving data and when single or double quotes are in the text they are displayed as this symbol ”.

JSP Code:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <title>General</title>
    </head>
    <body>
        <h1> <%= order.getDescription %> </h1> 
    </body>
</html>

Example: An order's description should look like this,

"20 - 4" x 6" widgets"

but I am getting this,

"20 - 4” x 6” widgets"

NOTE: I can not make modifications to the database.

[ EDIT ]

I used the commons-lang-2.4.jar to escape the characters and these are the primary characters giving me trouble:

  1. &#145 -> ‘
  2. &#146 -> ’
  3. &#147 -> “
  4. &#148 -> ”
  5. &#150 -> –

I am sure other characters in the some format would give me issues, however, I just did a replace on the characters for a temporary fix and I am currently testing the suggestions below.

[ CODE FOR SOLUTION ]

This probably not the best way to do it but it got the job done. The code below is in the backing bean after the the data is retrieved from the database.

description = StringEscapeUtils.escapeHtml(description);

description = description.replaceAll("&#145;", "&quot;");
description = description.replaceAll("&#146;", "&quot;");
description = description.replaceAll("&#147;", "&quot;");
description = description.replaceAll("&#148;", "&quot;");
description = description.replaceAll("&#150;", "-");

description = StringEscapeUtils.unescapeHtml(description);
+1  A: 

These are probably non-standard characters in your database...perhaps directional quotes instead of the straight up-and-down ones?

A straight-forward way to handle this, since you can't change the data in the database, would just be to use a replace or regex to swap out "bad" characters with ones that will display correctly.

Beska
this is not the exact answer but it lead to my solution.
Berek Bryan
A: 
Ned Batchelder
+6  A: 

That's character U+0094, which is a largely-unused control code. You will usually get characters in this range by accident if you use ISO-8859-1 to decode bytes that are actually in Windows codepage 1252 (Western European). They are similar encodings and often confused with each other, but the symbols in the range 0x80-0x9F are different. Windows cp1252 uses some of those for things like smart quotes, which is what you probably expected here: a double-close-quote (”, U+201D RIGHT DOUBLE QUOTATION MARK).

Such is the confusion that most web browsers, when told that a web page is ISO-8859-1, will actually use cp1252 instead and would render the quote. So this probably isn't a markup-side issue.

What you probably have is a database that contains CP1252, and a data access layer that is converting the bytes out of it to a String using ISO-8859-1 — perhaps because this is the server's default encoding. Ideally you'd want to configure the database to store Unicode strings natively, but if you can't do that you'll need to a way to configure your database connector to use the CP1252 encoding instead of ISO-8859-1. How you do this depends on what you're connecting with and to; you might have to set a property, or include a parameter in a connection string.

If you can't do that with your data layer, about the only thing left is to manually go over all the string values you get from the database and transcode them back to what they should be, by encoding with a ISO-8859-1 Encoding, followed by decoding with CP1252. This would be a real pain to do, but as a last resort would work.

[Side-issue: close-double-quote is the incorrect character for denoting inches. ″ (Unicode U+2033 DOUBLE PRIME) would be best, but if you're limited to legacy encodings, a straight " double-quote will do.]

bobince
I think your diagnosis is slightly off - looking at the result, he's got the right Unicode data in his string, but that gets encoded to Cp1252 but decoded using UTF-8 as per the metadata - see my answer for more.
McDowell
That was my immediate reaction but I don't think it actually is what's happening. If you include an invalid sequence such as a lone 0x94 byte in a UTF-8 page, most browsers will give you a replacement character, such as ‘?’ or ‘�’, not the actual control character ‘”’ as posted in the question. Of course it's always a bit tricky with questions like these as these kinds of characters can easily get mangled again before being pasted here...
bobince
Ah, yes, you are correct; I recant.
McDowell
Your answer does address a very common case, which might be useful to keep for stumbling googlers unrecanted. Derecanted? Decanted?
bobince
great write up very helpful was not able to get the CP1252 to work.
Berek Bryan
hmm... try "Windows_1252", I think that may be its name under Java.
bobince
A: 

0094 as pointed out, is not the straight double quote. Not that there is a problem with using a different quote, but 0094 is not available in most fonts - only some east asian fonts seem to have this character. In fact, it is the CANCEL character which falls in the control character category, and not the initial quote or final quote character categories.

It is also a relatively unused character, although it is present in the Latin-1 supplement Unicode block. So you could impose a filter (input or output) to handle this character.

The input filter would simply impose a whitelist of characters that your application will store, and obviously support in display.

The output filter would basically replace Unicode characters that give you problems, with better variants.

Vineet Reynolds