views:

429

answers:

5

I am not that good with encoding but I am even falling over with the basics here.

I am trying to create a file that is recognised as UTF-8

header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo "test";
exit();

also tried

header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo utf8_encode("test");
exit();

I then open the file with Notepad++ and it says its current encoding is ANSI not UTF-8, what am I missing how should I be outputting this file.

I will eventually be outputting an XML file of products for the Affiliate Window program. Also if it helps My webserver is Centos, Apache2, PHP 5.2.8.

Thanks in advance for any help!

+5  A: 

test is all ASCII. So there is no need to use UTF-8 for that.

But in fact, the first 128 characters of the Unicode charset are the same as ASCII’s charset. And UTF-8 uses the same code for that characters as ASCII does. See Wikipedia’s description of UTF-8 for furhter information.

Gumbo
Are you saying that i should output a lot more data and see if the encoding appears correct then?
Lizard
@Lizard: You need to use characters that are not in the ASCII charset to see difference.
Gumbo
+5  A: 

Once you download the file it no longer carries the information about the encoding, so Notepad++ has to guess it from the contents. There's a thing called Byte-Order-Mark which allows specifying the UTF encodings by prefix in the contents.

See question "When a BOM is used, is it only in 16-bit Unicode text?".

I would imagine using something like echo "\xEF\xBB\xBF" before writing the actual contents will force Notepad++ to recognize the file correctly.

Filip Navara
I wouldn't recommend using BOM. It's fairly exotic.
troelskn
UTF-8 is designed to be detectable from its byte sequences. You do not need the BOM. In fact it is only implemented by Microsoft, and using it only creates problems for cross platforms.
bucabay
While BOMs are certainly not useful everywhere and their use should be carefully considered, stating that they are exotic or that non-Microsoft system don't support them is plain wrong. It is supported by many text editors on MacOS X and Linux/BSD/Unix as well.
Filip Navara
+2  A: 

There is no such thing as headers for downloaded txt-files. As you try to create XML files in the end anyway, and you can specify the charset in the XML declaration, try creating a simple XML structure and save / open that, then it should work, as long as the OS has utf-8 support, which any modern Linux distribution should have.

Residuum
+4  A: 

As Filip said, encoding is not an intrinsic attribute of a file; It's implicit. This means that unless you know what encoding a file is to be interpreted in, there is no way to determine it. The best you can do, is to make a guess. This is presumably what programs such as Notepad++ does. Since the actual data that you have sent, can be interpreted in many different encodings, it just picks the candidate that it likes best. For Notepad++ this appears to be ANSI (Which in itself is a rather inaccurate classification), while other programs might default to something else.

The reason why you have to specify the charset in a HTTP-header is exactly because the file itself doesn't contain this information, so the browser needs to be informed about it. Once you have saved the file to disk, this information is thus unavailable.

If the file you're going to serve is an XML-document, you have the option of putting the encoding information inside the actual document. That way it is preserved after the file is saved to disk. Eg. if you are using utf-8, you should put this at the top of your document:

<?xml version="1.0" encoding="utf-8" ?>

Note that apart from getting the meta-information about the charset across, you also need to make sure that the data you are serving is actually utf-8 encoded. This is much the same scenario: You need to know implicitly what encoding your data are in. The function utf8_encode is (despite the name) explicitly meant for converting iso-8859-1 into utf-8. Thus, if you use it on already utf-8 encoded data, you'll get it double-encoded, with the result of garbled data.

Charsets aren't that complicated in itself. The problem is that if you aren't careful about keeping things straight you'll mess it up. Whenever you have a string, you should be absolutely certain that you know which encoding it is in. Otherwise it's not a string - it's just a blob of binary data.

troelskn
Just changed my answer to you as you have given me the most comprehensive answer. Thanks for all that, it has helped my understanding alot! The The function utf8_encode did actually help me alot because of the the way I was storing data in the DB. Thanks again!
Lizard
A: 

I refer you to Joel's Absolute minimum every software developer should know about Unicode