views:

762

answers:

4

Our application takes text from a web form and sends it via email to an appropriate user. However, when someone copy/pastes in the infamous "smart quotes" or other special characters from Word, things get hairy.

The user types in

he said “hello” to me—isn’t that nice?

But when the message appears in Outlook 2003, it comes out like this:

he said “hello” to me—isn’t that nice?

The code for this was:

Session session = Session.getInstance(props, new MailAuthenticator());
Message msg = new MimeMessage(session);

//removed setting to/from addresses to simplify

msg.setSubject(subject);
msg.setText(text);
msg.setHeader("X-Mailer", MailSender.class.getName());
msg.setSentDate(new Date());
Transport.send(msg);

After a little research, I figured this was probably a character encoding issue and attempted to move things to UTF-8. So, I updated the code thusly:

Session session = Session.getInstance(props, new MailAuthenticator());
MimeMessage msg = new MimeMessage(session);

//removed setting to/from addresses to simplify

msg.setHeader("X-Mailer", MailSender.class.getName());
msg.addHeader("Content-Type", "text/plain");
msg.addHeader("charset", "UTF-8");
msg.setSentDate(new Date());
Transport.send(msg);

This got me closer, but no cigar:

he said “hello” to me—isn’t that nice?

I can't imagine this is an uncommon problem--what have I missed?

A: 

Why don't you replace the nice quotes with regular prime quotes?

Daniel A. White
That's certainly an option, but if I can avoid having to create a map of "replace <character x> with <character y>" rules, I'd like to.
abeger
+1  A: 

Is the page with your form also using UTF-8, or a different charset? If you don't specify the webpage charset, the format of data coming to your script is anyone's guess.


Edit: the charset in the message should be set like this:

msg.addHeader("Content-Type", "text/plain; charset=UTF-8");

since charset is not a separate header, but an option to Content-type

Piskvor
Try to set the pages charset. (to UTF-8). I think it is up to explorer to convert the pasted characters. “test”
KarlP
A: 

I would check that the data being received from the browser is correct - dump the Unicode code points and check them against the charts:

  public static void printCodepoints(char[] s) {
    for (int i = 0; i < s.length; i++) {
      int codePoint = Character.isHighSurrogate(s[i]) ? Character
          .toCodePoint(s[i], s[++i])
          : s[i];
      System.out.println(Integer.toHexString(codePoint));
    }
  }

For example, the symbol DOUBLE LEFT QUOTATION MARK () is character U+201C.

It has been a long time since I used the mail API, but the MimeMessage.html.setText(text, charset) method might be worth a look. The documentation on setText(String) says it uses the default character set (probably windows-1252 if you're using English/Latin-1 Windows).

McDowell
A: 

IIRC, MS Office quotes are found characterset "iso-8859-1".

Cheers!

Dave

dave wanta