views:

4420

answers:

11

Aparently, encoding japanese emails is somewhat challenging, which I am slowly discovering myself. In case there are any experts (even those with limited experience will do), can I please have some guidelines as to how to do it, how to test it and how to verify it?

Bear in mind that I've never set foot anywhere near Japan, it is simply that the product I'm developing is used there, among other places.

What (I think) I know so far is following:
- Japanese emails should be encoded in ISO-2022-JP, Japanese JIS codepage 50220 or possibly SHIFT_JIS codepage 932
- Email transfer encoding should be set to Base64 for plain text and 7Bit for Html
- Email subject should be encoded separately to start with "=?ISO-2022-JP?B?" (don't know what this is supposed to mean). I've tried encoding the subject with

"=?ISO-2022-JP?B?" + Convert.ToBase64String(Encoding.Unicode.GetBytes(subject))

which basically gives the encoded string as expected but it doesn't get presented as any japanese text in an email program
- I've tested in Outlook 2003, Outlook Express and GMail

Any help would be greatly appreciated

+4  A: 

Check http://en.wikipedia.org/wiki/MIME#Encoded-Word for a description on how to encode header fields in MIME-compliant messages. You seem to be missing a “?=” at the end of your subject.

Bombe
+1  A: 

=?ISO-2022-JP?B?TEXTTEXT...

ISO_2022-JP means that string is encoded in ISO-2022-JP codepage (eg. not Unicode) B means that string is bese64 encoded

In your example, you should just supply your string in ISO-2022-JP instead of Unicode.

dmajkic
A: 

Ok, so to post a short update, thanks to the two helpful answers, I've managed to get the right format and encoding. Now, Outlook gives something that resembles the correct subject:
=?iso-2022-jp?B?6 Japanese test に各々の視点で語ってもらった。 6相当の防水?=

However, the exact same email in Outlook Express gives subject like this:
=?iso-2022-jp?B?6 Japanese test 縺ォ蜷・€・・隕也せ縺ァ隱槭▲縺ヲ繧ゅi縺」縺溘€・ 6逶ク蠖薙・髦イ豌エ?=

Furthermore, when viewed in the Inbox view in Outlook Express, the email subject is even more weird, like this:
=?iso-2022-jp?B?6 Japanese test ??????????????? 6???????=

Gmail seems to be working in the similar fashion to Outlook, which looks correct.

I just can't get my head around this one.

danijels
it sounds like you've got the encoding mechanics worked out - if you've got it working in 2 places. are you sure you have outlook express set to the right encoding? from what you describe it sounds like the 2 outlook express views are going to extra trouble to decode the mail in utf8 and iso-8059-1 (latin-1) respectively. maybe you can just go to view->encoding and change that?
blackkettle
A: 

I am having a very similar problem. I have set my regional and Language options to east asian and languages for non-unicode in the xp control pannel. I have to process pst files and preserve the true metadata and I am having touble with the subject line and sometimes the to: and cc: fields. I get my message body to show Japanese fine but then I get gebbrish in the subject as shown below

CC FIELD: cc. │ᄄネ￧ヤᄏ ̄タタ₩ンノ¥ᄆᄆ₩ルᄎ₩チメ

SUBJECT FIELD: Re: ä¸‰è±ï¼¬ï¼£ï¼¤æ’¤é€€ã«é–¢ã™ã‚‹æƒ…å ±åŠã³åŒ—ç±³æ¶²æ™¶çŠ¶æ³

MESSAGE BODY: 佐藤さんへ:情報ありがとうございます。この機に是非とも三菱パークをリプレースしたいものです。 ところでこのシニアマネージャーはどうされたのですか?内も苦しいですが。

中村マネージャー:ADIはCPTへ売却打診中とのこと。うーん。

I am not a programer so please simplify any recomendations you have as to how I can fix the subject line. FYI, I am using outlook 07 pro, Windows XP pro and the PST files are preexhisting so they are being opened via: File-->open outlook data file. Please HELP!

A: 

First of all you should be using:

Encoding.GetEncoding("ISO-2022-JP")

to convert your subject line into bytes that will be processed by Convert.ToBase64String().

=?ISO-2022-JP?B?TEXTTEXT...?= tells the receiving mail client which encoding was used on the sender's side to convert japanese "letters" into a byte stream.

Currently you're using UTF-16 to encode, but specifying ISO-2022-JP to decode. These are obviously two different encodings, I guess, just like ISO-8859-1 is different from Unicode (most extended western-europe chars are represented by one byte in ISO-XXX, but two bytes in Unicode).

I'm not sure what you mean about UTF-8 being second-class citizen. As long as the receiving mail client understands UTF-8 and is able to convert it to the current japanese locale, everything is fine.

liggett78
UTF-8 is definitely a second-class citizen in Japan — depressingly, as their own standards are bloody terrible. Cell phone web browsers have only just caught up to supporting it and there are still web mail providers that can't understand incoming UTF-8 mail. It's absolutely pathetic.
bobince
A: 

Hey Buddy,

I have some experience composing and sending email in japanese...Normally you have to beware what encoding used for operating system and how you store your japanese strings! My Mail objects are normally encoded as follows:

    string s = "V‚µ‚¢ŠwK–@‚Ì‚²’ñˆÄ"; // Our japanese are shift-jis encoded, so it appears like garbled
    MailMessage message = new MailMessage();
    message.BodyEncoding = Encoding.GetEncoding("iso-2022-jp");
    message.SubjectEncoding = Encoding.GetEncoding("iso-2022-jp");
    message.Subject = s.ToEncoding(Encoding.GetEncoding("Shift-Jis")); // Change the encoding to whatever your source is
    message.Body = s.ToEncoding(Encoding.GetEncoding("Shift-Jis")); // Change the encoding to whatever your source is

Then i have an extension method to which does the conversion for me:

public static string ToEncoding(this string s, Encoding targetEncoding)
        {   
            return s == null ? null : targetEncoding.GetString(Encoding.GetEncoding(1252).GetBytes(s)); //1252 is the windows OS codepage            
        }
+11  A: 

I've been dealing with Japanese encodings for almost 20 years and so I can sympathize with your difficulties. Websites that I've worked on send hundreds of emails daily to Japanese customers so I can share with you what's worked for us.

  • First of all, do not use Shift-JIS. I personally receive tons of Japanese emails and almost never are they encoded using Shift-JIS. I think an old (circa Win 98?) version of Outlook Express encoded outgoing mail using Shift-JIS, but nowadays you just don't see it.

  • As you've figured out, you need to use ISO-2022-JP as your encoding for at least anything that goes in the mail header. This includes the Subject, To line, and CC line. UTF-8 will also work in most cases, but it will not work on Yahoo Japan mail, and as you can imagine, many Japanese users use Yahoo Japan mail.

  • You can use UTF-8 in the body of the email, but it is recommended that you base64 encode the UTF-8 encoded Japanese text and put that in the body instead of raw UTF-8 text. However, in practice, I believe that raw UTF-8 text will work fine these days, for the body of the email.

  • As I alluded to above, you need to at least test on Outlook (Exchange), Outlook Express (IMAP/POP3), and Yahoo Japan web mail. Yahoo Japan is the trickiest because I believe they use EUC for the encoding of their web pages, and so you need to follow the correct standards for your emails or they won't work (ISO-2022-JP is the standard for sending Japanese emails).

  • Also, your subject line should not exceed 75 characters per line. That is, 75 characters after you've encoded in ISO-2022-JP and base64, not 75 characters before conversion. If you exceed 75 characters, you need to break your encoded subject into multiple lines, starting with "=?iso-2022-jp?B?" and ending with "?=" on each line. If you don't do this, your subject might get truncated (depending on the email reader, and also the content of your subject text). According to RFC 2047:

"An 'encoded-word' may not be more than 75 characters long, including 'charset', 'encoding', 'encoded-text', and delimiters. If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used."

  • Here's some sample PHP code to encode the subject:

 // Convert Japanese subject to ISO-2022-JP (JIS is essentially ISO-2022-JP)

 $subject = mb_convert_encoding ($subject, "JIS", "SJIS");

 // Now, base64 encode the subject

 $subject = base64_encode ($subject);

 // Add the encoding markers to the subject

 $subject = "=?iso-2022-jp?B?" . $subject . "?=";

 // Now, $subject can be placed as-is into the raw mail header.
  • See RFC 2047 for a complete description of how to encode your email header.
保田ジェフリー
Great reply! Not only does Yahoo mail have trouble supporting UTF-8, but most Japanese cell phones still do not support receiving email in UTF-8, so you are stuck with iso-2022-jp
Elijah
Actually I've found the webmail services and cellphones support Shift-JIS fine. It's the most compact of the available encodings, so we go for that and haven't had any problems yet.
bobince
definitely best to stick with iso-2022-jp, at least for the subject, as this is by far the most widely supported. especially in the case of cell phones. in most cases new phones (especially softbank) now support utf-08 but anything more than a year old will almost certainly not support utf-8. also be careful to stay away from iso-2022-jp-ext. its *almost* the same as iso-2022-jp but my experience is that the extended characters are very often not supported by many cell phones.
blackkettle
A: 

something like this should get the job done in python:


#!/usr/bin/python                                                                                                            
# -*- mode: python; coding: utf-8 -*-                                                                                        
import smtplib
from email.MIMEText import MIMEText
from email.Header import Header
from email.Utils import formatdate

def send_from_gmail( from_addr, to_addr, subject, body, password, encoding="iso-2022-jp" ):

    msg = MIMEText(body.encode(encoding), 'plain', encoding)
    msg['Subject'] = Header(subject.encode(encoding), encoding)
    msg['From'] = from_addr
    msg['To'] = to_addr
    msg['Date'] = formatdate()

    s = smtplib.SMTP('smtp.gmail.com', 587)
    s.ehlo(); s.starttls(); s.ehlo()

    s.login(from_addr, password)
    s.sendmail(from_addr, to_addr, msg.as_string())
    s.close()
    return "Sent mail to: %s" % to_addr



if __name__ == "__main__":
    import sys
    for n,item in enumerate(sys.argv):
        sys.argv[n] = sys.argv[n].decode("utf8")

    if len(sys.argv)==6:
        print send_from_gmail( sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5] )
    elif len(sys.argv)==7:
        print send_from_gmail( sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5], encoding=sys.argv[6] )
    else:
        raise "SYNTAX: %s <from_addr> <to_addr> <subject> <body> <password> [encoding]"

**blatantly stolen/adapted from:

http://mtokyo.blog9.fc2.com/blog-entry-127.html

blackkettle
A: 
<?php

function sendMail($to, $subject, $body, $from_email,$from_name)
 {
$headers  = "MIME-Version: 1.0 \n" ;
$headers .= "From: " .
       "".mb_encode_mimeheader (mb_convert_encoding($from_name,"ISO-2022-JP","AUTO")) ."" .
       "<".$from_email."> \n";
$headers .= "Reply-To: " .
       "".mb_encode_mimeheader (mb_convert_encoding($from_name,"ISO-2022-JP","AUTO")) ."" .
       "<".$from_email."> \n";


$headers .= "Content-Type: text/plain;charset=ISO-2022-JP \n";


/* Convert body to same encoding as stated
in Content-Type header above */

$body = mb_convert_encoding($body, "ISO-2022-JP","AUTO");

/* Mail, optional parameters. */
$sendmail_params  = "-f$from_email";

mb_language("ja");
$subject = mb_convert_encoding($subject, "ISO-2022-JP","AUTO");
$subject = mb_encode_mimeheader($subject);

$result = mail($to, $subject, $body, $headers, $sendmail_params);

return $result;
}
avanish
A: 

I've one doubt. Why Quoted-printable encoding (Q encoding) should not be used with iso-2022-jp charset in email? Why we have stick to Base 64 Encoding(B encoding)please clarify this

Sathish
Base64 is more space-efficient if less than 5/6 of the original bytes are ASCII.
dan04
A: 

Introduction of Japanese encoding to e-mail happened at JUNET(UUCP based nation-wide network) in early 90's.

At that time, RFC1468 was defined. If you follow RFC1468 in plain text mail, there would be no problem.

If you want to handle html mail, RFC1468 is useless except for header parts.

kmugitani