views:

82

answers:

3

Hi all,

I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:

  EE.04 Järvamaa
  EE.05 Jõgevamaa
  EE.07 Läänemaa

However when I extract that and write it to a new file, the text becomes:

  EE.04  Järvamaa
  EE.05  Jõgevamaa
  EE.07  Läänemaa

To save my files I'm using the following code:

mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);

I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?

Thank you!

A: 

You can do it as follows:

<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?> 
Aman Kumar Jain
+1  A: 

First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).

Try just getting rid of the mb_detect_encoding line all together.

Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).

Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...

ircmaxell
When I used iconv with an empty input encoding it gives me a notice that the function "Detected an illegal character in input string" and the output is still messed. How would I go about trying to figure out the source encoding?
You could try using [`mb_detect_encoding`](http://www.php.net/manual/en/function.mb-detect-order.php) and feeding the result into `iconv`. But realize there are some limitations to the charsets that it can detect. If it returns `false`, your other option is to bust out a hex editor and find the entities for a known multi-byte character and then try searching for that glyph on the internet to try to figure out what encoding it is. Where did you get the files from (That may provide a hint)...
ircmaxell
@ircmaxell, I got the file from geonames.org (http://download.geonames.org/export/dump/readme.txt). According to the website the encoding is UTF-8.
Well, it's likely not valid `UTF-8` then. Even if it had a [BOM (Byte Order Mark)](http://en.wikipedia.org/wiki/Byte_order_mark) it would be treated like a normal character... So something else is going on. The only thing I could say is to use [Wikipedia](http://en.wikipedia.org/wiki/UTF-8) as a reference with a hex editor and try to find out where the invalid character(s) is(are)...
ircmaxell
The output I got was: ef 41 41 41 41 41 41 41 41 41What does that mean exactly?
Well, `ef` is the start of a three byte sequence. `41` is a single byte encoded `A` character. So that's the invalid encoding... Are there really a bunch of `A` characters at the start of the file? I'd suggest re-downloading the file and seeing if it's still corrupt. If not, you COULD try running this: `$string = str_replace(chr(239).chr(65).chr(65), chr(239).chr(197).chr(191).chr(65).chr(65), $string)` which will try to manually expand out the `ef` to a full BOM marker (`ef bb bf` in hex, `239 187 191` in decimal)...
ircmaxell
A: 

It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.

dvanaria