ansaurus

Question

Problem writing UTF-8 encoded file in PHP

Answer 1

A:

You can do it as follows:

<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>

Aman Kumar Jain 2010-08-20 16:34:20

Answer 2

+1 A:

First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).

Try just getting rid of the mb_detect_encoding line all together.

Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).

Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...

ircmaxell 2010-08-20 16:38:44

When I used iconv with an empty input encoding it gives me a notice that the function "Detected an illegal character in input string" and the output is still messed. How would I go about trying to figure out the source encoding?

2010-08-20 16:47:04

You could try using [`mb_detect_encoding`](http://www.php.net/manual/en/function.mb-detect-order.php) and feeding the result into `iconv`. But realize there are some limitations to the charsets that it can detect. If it returns `false`, your other option is to bust out a hex editor and find the entities for a known multi-byte character and then try searching for that glyph on the internet to try to figure out what encoding it is. Where did you get the files from (That may provide a hint)...

ircmaxell 2010-08-20 16:53:16

@ircmaxell, I got the file from geonames.org (http://download.geonames.org/export/dump/readme.txt). According to the website the encoding is UTF-8.

2010-08-20 16:58:46

Well, it's likely not valid `UTF-8` then. Even if it had a [BOM (Byte Order Mark)](http://en.wikipedia.org/wiki/Byte_order_mark) it would be treated like a normal character... So something else is going on. The only thing I could say is to use [Wikipedia](http://en.wikipedia.org/wiki/UTF-8) as a reference with a hex editor and try to find out where the invalid character(s) is(are)...

ircmaxell 2010-08-20 17:06:53

The output I got was: ef 41 41 41 41 41 41 41 41 41What does that mean exactly?

2010-08-20 17:07:31

Well, `ef` is the start of a three byte sequence. `41` is a single byte encoded `A` character. So that's the invalid encoding... Are there really a bunch of `A` characters at the start of the file? I'd suggest re-downloading the file and seeing if it's still corrupt. If not, you COULD try running this: `$string = str_replace(chr(239).chr(65).chr(65), chr(239).chr(197).chr(191).chr(65).chr(65), $string)` which will try to manually expand out the `ef` to a full BOM marker (`ef bb bf` in hex, `239 187 191` in decimal)...

ircmaxell 2010-08-20 17:12:39

Answer 3

A:

It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.

dvanaria 2010-08-20 16:39:49

ansaurus

tags:

views:

answers:

Problem writing UTF-8 encoded file in PHP

related questions