views:

5166

answers:

3

I have a script which combines a number of files into one, and it breaks when one of the files has UTF8 encoding. I figure that I should be using the utf8_decode() function when reading the files, but I don't know how to tell which need decoding.

My code is basically:

$output = '';
foreach ($files as $filename) {
    $output .= file_get_contents($filename) . "\n";
}
file_put_contents('combined.txt', $output);

Currently, at the start of a UTF8 file, it adds these characters in the output: 

+1  A: 

How are you going to handle the non-ascii characters from the UTF-8 or 16 or 32 file?

I ask because I think you may have a design issue here.

I would convert your output file into UTF-8 (or 16 or 32) instead of the other way around.

Then you won't have this problem.

Have you also considered the security issues that may arise from converting an escaped UTF8 code? See this comment:

Detecting multi-byte encoding

Figure out what encoding your source file is in, then convert it to UTF8 and you should be good to go.

cbrulak
ok, so I should convert the non UTF-8 files into UTF-8... how do I tell which need a call to utf_encode()?
nickf
You don't. You have to know which encoding data are in - There is no reliable way to determine it, if you don't know.
troelskn
+2  A: 

Try using the mb_detect_encoding function. This function will examine your string and attempt to "guess" what its encoding is. You can then convert it as desired. As brulak suggested, however, you're probably better off converting to UTF-8 rather than from, to preserve the data you're transmitting.

Ben Blank
I give up. I have *no idea* why SO is destroying my links. Looks fine in preview. :-/
Ben Blank
yeah wow... wtf is going on there. Cheers for the link though
nickf
Seems to have been the relative link confusing it... weird
bobince
@bobince — I *swear* I've used them before without problems. Guess I won't anymore. ;-)
Ben Blank
A: 

Your answer is here : http://en.wikipedia.org/wiki/Utf8#Byte-order_mark

Choose utf8 without BOM for PHP.