views:

592

answers:

2

msgfmt “invalid multibyte sequence” error on a Polish text is corrected by manually editing the MIME Content-Type charset in the template file. Is there some command or option for the xgettext, msginit, msgfmt sequence for setting the MIME type?

cat >plt.cxx <<EOF
// plt.cxx
#include <libintl.h>
#include <locale.h>
#include <iostream>
int main (){
    setlocale(LC_ALL, "");
    bindtextdomain("plt", ".");
    textdomain( "plt");
    std::cout << gettext("Invalid input. Enter a string at least 20 characters long.") << std::endl;
}
EOF
g++ -o plt plt.cxx
xgettext --package-name plt --package-version 1.2 --default-domain plt --output plt.pot plt.cxx
sed --in-place plt.pot --expression='s/CHARSET/UTF-8/'
msginit --no-translator --locale pl_PL --output-file plt_polish.po --input plt.pot
sed --in-place plt_polish.po --expression='/#: /,$ s/""/"Nieprawidłowo wprowadzone dane. Wprowadź ciąg przynajmniej 20 znaków."/'
mkdir --parents ./pl_PL.utf8/LC_MESSAGES
msgfmt --check --verbose --output-file ./pl_PL.utf8/LC_MESSAGES/plt.mo plt_polish.po
LANGUAGE=pl_PL.utf8 ./plt
+1  A: 

There is no argument for setting the output character encoding directly, but this should in pratice not be a problem, as your PO editor will automatically use an appropriate character encoding when saving the PO file (one that supports all the characters used in the translation), and replace CHARSET in the file with the name of the encoding. If it doesn’t, file a bug.

The only problem would be if the POT file contained non-ASCII characters, but xgettext does have a --from-code argument for this, which specifies the encoding of the input files. If the input contains non-ASCII characters and --from-code is set to the correct encoding, the output POT file will have the character encoding set to UTF-8 (this need not be equal to the input character encoding). However, if the input files only contain ASCII characters, --from-code=UTF-8 will unfortunately have no effect.

msginit does in fact automatically set the character encoding to something ‘appropriate’ for the chosen target locale. However, the list of locale to character encoding pairs seems outdated; UTF-8 is now really the best choice for all languages.

An alternative would be to use pot2po instead of msginit. This always uses UTF-8 automatically, AFAICS. However, unlike msginit, it does not automatically fill out the plural forms of the PO file, which may or may not be a problem (some think it is the job of the PO editor to do this).

Karl Ove Hufthammer
A: 

Just give full locale name and msginit will set charset correctly

msginit --no-translator --input=xx.pot --locale=ru_RU.UTF-8

results in

"Language: ru\n"
"Content-Type: text/plain; charset=UTF-8\n"
Anatoly