views:

10245

answers:

11

Hello!

I'm reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO-8859-1.

Unfortunately, there are sometimes problems with the encodings of the texts. Example:

1) The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.

2) Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.

3) In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.

What can I do to avoid the cases 2 and 3?

How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?

Can you help me and tell me how to make everything the same encoding? Perhaps with the function mb-detect-encoding()? Can I write a function for this? So my problems are: 1) How to find out what encoding the text uses 2) How to convert it to UTF-8 - whatever the old encoding is

Thanks in advance!

EDIT: Would a function like this work?

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

I've tested it but it doesn't work. What's wrong with it?

+1  A: 

It's simple: when you get something that's not UTF8, you must ENCODE that INTO utf8.

So, when you're fetching a certain feed that's ISO-8859-1 parse it through utf8_encode.

However, if you're fetching an UTF8 feed, you don't need to do anything.

Seb
Thanks! OK, I can find out how the feed is encoded by using mb-detect-encoding(), right? But what can I make if the feed is ASCII? utf8-encode() ist just for ISO-8859-1 to UTF-8, isn't it?
ASCII is a subset of ISO-8859-1 AND UTF-8, so using utf8-encode() should not make a change - IF it's actually just ASCII
Michael Borgwardt
So I can always use utf8_encode if it's not UTF-8? This would be really easy. The text which was ASCII according to mb-detect-encoding() contained "ä". Is this a ASCII character? Or is it HTML?
That's HTML. Actually that's encoded so when you print it in a given page it shows ok. If you want you can first ut8_encode() then html_entity_decode().
Seb
Yes, html_entity_decode() works in this case. But: The German "ß" emerges in different forms: Sometimes "Ÿ", sometimes "ß" and sometimes "ß". Why?
The character ß is encoded in UTF-8 with the byte sequence 0xC39F. Interpreted with Windows-1252, that sequence represents the two characters  (0xC3) and Ÿ (0x9F). And if you encode this byte sequence again with UTF-8, you’ll get 0xC383 0xC29F what represents ß in Windows-1252. So your mistake is to handle this UTF-8 encoded data as something with an encoding other than UTF-8. That this byte sequence is presented as the character you’re seeing is just a matter of interpretation. If you use an other encoding/charset, you’ll probably see other characters.
Gumbo
Thank you. First, I want to say that all UTF-8 characters are shown as interpreted with Windows-1252 in my PHPMyAdmin. I don't handle them wrong. "Ÿ" is displayed correctly as "ß". I do the same things with all RSS feeds but some feeds are parsed as "Ÿ" and some are parsed as "ß". That's the problem. Can't I do the following: Look for "Ã" in the text. If it is in the text, then it must be double UTF-8 encoded. So I simply decode it one time and everything is fine. Would this work? How could I code this?
That’s why you should take the declared encoding into account. Because not every data is encoded with the same encoding using the same character set. There are plenty different character sets. Just by looking at the byte sequences you cannot determine what character set had been used. Take the ISO 8859 character set family as an example: 15 different character sets all use the same encoding.
Gumbo
Thank you for you help! You've definitely convinced me to use the standards way. Is this script correct? http://paste.bradleygill.com/index.php?paste_id=9651 (Sorry for posting it several times as a comment but you shouldn't overlook it. One answer is enough for me. :)
+2  A: 

Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.

So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).

Kevin ORourke
I don't want to read out the encoding from the feed information. So it's equal if the feed information are wrong. I would like to detect the encoding from the text.
@marco92w: It’s not your problem if the declared encoding is wrong. Standards have not been established for fun.
Gumbo
@Gumbo: but if you're working in the real world you have to be able to deal with things like incorrect declared encodings. The problem is that it's very difficult to guess (correctly) the encoding just from some text.Standards are wonderful, but many (most?) of the pages/feeds out there doesn't comply with them.
Kevin ORourke
@Kevin ORourke: Exactly, right. That's my problem.@Gumbo: Yes, it's my problem. I want to read out the feeds and aggregate them. So I must correct the wrong encodings.
@marco92w: But you cannot correct the encoding if you don’t know the correct encoding and the current encoding. And that’s what the `charset`/`encoding` declaration if for: describe the encoding the data is encoded in.
Gumbo
Oh, now I've understood it. I thought it would be possible because I can surely say that "Ã" can't appear but "Ÿ" does. Another method I had imagined was to utf8_decode() it and then look whether it is a normal text. If there is any "Ã" after utf8_decode() then it must be wrong.
@marco92w: Again, the character that’s shown to you depends on the character encoding/set that was used to interpret the data. If you interpret UTF-8 encoded with something other than UTF-8 you will probably get some oddities (excet you’re just using ASCII characters).
Gumbo
Thank you for you help! You've definitely convinced me to use the standards way. Is this script correct? http://paste.bradleygill.com/index.php?paste_id=9651 (Sorry for posting it several times as a comment but you shouldn't overlook it. One answer is enough for me. :)
A: 

php.net/mb_detect_encoding

echo mb_detect_encoding($str, "auto");

or

echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");

i really don't know what the results are, but i'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.

update
auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". it returns the detected charset, which you can use to convert the string to utf-8 with iconv.

<?php
function convertToUTF8($str) {
    $enc = mb_detect_encoding($str);

    if ($enc && $enc != 'UTF-8') {
        return iconv($enc, 'UTF-8', $str);
    } else {
        return $str;
    }
}
?>

i haven't tested it, so no guarantee. and maybe there's a simpler way.

Schnalle
Thank you. What's the difference between 'auto' and 'UTF-8, ASCII, ISO-8859-1' as the second argument? Does 'auto' feature more encodings? Then it would be better to use 'auto', wouldn't it?If it really works without any bugs then I must only change "ASCII" or "ISO-8859-1" to "UTF-8". How?
Your function doesn't work well in all cases. Sometimes I get an error:Notice: iconv(): Detected an illegal character in input string in ...
+8  A: 

Detecting the encoding is hard.

mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.

troelskn
Thank you very much!What's better: mb-convert-encoding() or iconv()? I don't know what the differences are.Yes, I will only have to parse Western European languages, especially English, German and French.
I've just seen: mb-detect-encoding() ist useless. It only supports UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS and ISO-2022-JP. The most important ones for me, ISO-8859-1 and WINDOWS-1252, aren't supported. So I can't use mb-detect-encoding().
My, you're right. It's been a while since I've used it. You'll have to write your own detection-code then, or use an external utility. UTF-8 can be fairly reliably determined, because its escape sequences are quite characteristic. wp-1252 and iso-8859-1 can be distinguished because wp-1252 may contain bytes that are illegal in iso-8859-1. Use Wikipedia to get the details, or look in the comments-section of php.net, under various charset-related functions.
troelskn
I think you can distinguish the different encodings when you look at the forms which the special sings emerge in: The German "ß" emerges in different forms: Sometimes "Ÿ", sometimes "ß" and sometimes "ß". Why?
Yes, but then you need to know the contents of the string before comparing it, and that kind of defeats the purpose in the first place. The German ß appears differently because it has different values in different encodings. Somce characters happen to be represented in the same way in different encodings (eg. all characters in the ascii charset are encoded in the same way in utf-8, iso-8859-* and wp-1252), so as long as you use just those characters, they all look the same. That's why they are some times called ascii-compatible.
troelskn
Ok, then it's quite easy, isn't it? Can't I just look for "Ã" in the texts? This only emerges if the text is double UTF-8 encoded, so too often encoded. So I must only decode it one time, right? The "Ã" wouldn't appear if the text is correct since the "Â" doesn't appear in German or English texts normally. Would this be a good approach? How could I code this in PHP? Would it work?
You cannot always tell just from looking for such oddities if some data is not proper encoded. There always might be the possibility that they are intended. Take your own question as an example.
Gumbo
Yes, they might be intended. But I would be fine for me if 99% of the texts are displayed correctly and only 1% is displayed wrongly because the "strange" characters were intended. If there was a possibility to achieve this, I would like to use it.
@marco92w: Well then I’d suggest to try the standards way. I’d say the error rate is not much higher than with your guessing method. But even if it’s higher you would support the standards.
Gumbo
Thank you for you help! You've definitely convinced me to use the standards way. Is this script correct? http://paste.bradleygill.com/index.php?paste_id=9651 (Sorry for posting it several times as a comment but you shouldn't overlook it. One answer is enough for me. :)
+15  A: 

You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.


Edit   Here is what I probably would do:

I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';

$accept = array(
    'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
    'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
    'Accept: '.implode(', ', $accept['type']),
    'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
    // error fetching the response
} else {
    $offset = strpos($response, "\r\n\r\n");
    $header = substr($response, 0, $offset);
    if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
        // error parsing the response
    } else {
        if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
            // type not accepted
        }
        $encoding = trim($match[2], '"\'');
    }
    if (!$encoding) {
        $body = substr($response, $offset + 4);
        if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
            $encoding = trim($match[1], '"\'');
        }
    }
    if (!$encoding) {
        $encoding = 'utf-8';
    } else {
        if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
            // encoding not accepted
        }
        if ($encoding != 'utf-8') {
            $body = mb_convert_encoding($body, 'utf-8', $encoding);
        }
    }
    $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
    if (!$simpleXML) {
        // parse error
    } else {
        echo $simpleXML->asXML();
    }
}
Gumbo
Thanks. This would be easy. But would it really work? There are often wrong encodings given in the HTTP headers or in the attributes of XML.
Again: That’s not your problem. Standards were established to avoid such troubles. If others don’t follow them, it’s their problem, not yours.
Gumbo
Ok, I think you've finally convinced me now. :)
Thanks for the code. But why not simply use this? http://paste.bradleygill.com/index.php?paste_id=9651Your code is much more complex, what's better with it?
Well, firstly you’re making two requests, one for the HTTP header and one for the data. Secondly, you’re looking for any appearance of `charset=` and `encoding=` and not just at the appropriate positions. And thirdly, you’re not checking if the declared encoding is accepted.
Gumbo
Ok, thank you very much. So I will use your function. But one last question: Why doesn't it work correctly? Here you can see some question marks so the browser can't show these characters correctly. Something must fail while converting the charset: http://www.twem.de/1.php
You’re not sending any encoding information. Thus the default in HTML (ISO 8859-1) is used.
Gumbo
No, that's not the cause. In line 26 of your code there is an error: undefined offset 2: $encoding = trim($match[2], '"\'');Sometimes the characters are correct (ö instead of ö), sometimes they aren't (À instead of ä). So there must be something wrong in your code or in the feed I want to parse.
Well then add a line to check if `$match[2]` exists before using it.
Gumbo
If $match[2] is set, it's clear that everything is going on as normal. But what to do if $match[2] is not set? Return false?
No, just do nothing. If there is no encoding declared in the HTTP header, the encoding in the XML declaration is used. And if that’s missing too, the default encoding is used.
Gumbo
Yes, logical. :) My very last question: Why is the following line there?if (!in_array($encoding, array_map('strtolower', $accept['charset']))) { // encoding not accepted }Can't I just let it out?
That piece of code was intended to accept just the charsets/encodings `mb_convert_encoding` accepts (see `mb_list_encodings`). Otherwise `mb_convert_encoding` will probably throw an error.
Gumbo
But it doesn't prevent block wrong encodings/charsets since the following line is no elseif but a normal if, right? So the line can be deleted without changing something, can't it?
Your code also gives this error message: Warning: mb_convert_encoding() [function.mb-convert-encoding]: Illegal character encoding specified
Then try to find out the cause of this error. It took me just ten minutes to write that code and didn’t tested it well. It might have some errors more than this.
Gumbo
+1  A: 

Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had iso-8859-1, converted from iso-8859-1 to utf-8, and treated the new string as iso-8859-1 for another conversion into UTF-8.

Here's some pseudocode of what you did:

$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);

You should try:

  1. detect encoding using mb_detect_encoding() or whatever you like to use
  2. if it's UTF-8, convert into iso-8859-1, and repeat step 1
  3. finally, convert back into UTF-8

That is presuming that in the "middle" conversion you used iso-8859-1. If you used windows-1252, then convert into windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

German language also uses iso-8859-2 and windows-1250 (latin2).

Ivan Vučica
A: 

As already mentioned above: encoding issues can be quite tedious.

I've used a guide on http://www.phpwact.org/php/i18n/charsets (with a link to a dedicated utf-8 guide), and this resolve my issues. The page is still under construction, but is does provide a very precise description of the relevant issues when using utf-8.

It sounds like case 3 is what you actually want: the characters are correct in the database. Usually it is sufficient to apply utf8_encode once before displaying the string.

Martijn
Thanks for the link. It's not case 3 since the characters are NOT ALL correct in the database. Some characters are displayed as "Ÿ" which gives the correct output. But some are saved as "ß" which seems to be a sign for that I've encoded it twice.By the way: Are there any differnces between utf8_general_ci and utf8_unicode_ci?
Probably there is some difference between utf8_general_ci and utf8_unicode_ci, but I have never seen any. I'm using utf8_unicode_ci myself.Regarding your comment about the three cases: the method I'm describing will display case 3 correctly. If you have a utf8 message and want to enter it into the database, make sure to use utf8_decode; this will ensure that you're in case 3.Only problem left is to figure out if a message is in utf8 or not.
Martijn
+7  A: 
miek
A: 

I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.

Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s.

//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
    $process = array(&$_GET, &$_POST, &$_REQUEST);
    while (list($key, $val) = each($process)) {
        foreach ($val as $k => $v) {
            unset($process[$key][$k]);
            if (is_array($v)) {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
                $process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
            } else {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
            }
        }
    }
    unset($process);
}
catch(Exception $ex){}
jocull
Thanks for the answer, jocull. The function mb_convert_encoding() is what we've already had here, right? ;) So the only new thing in your answer is the loops to change encoding in all variables.
A: 

A really nice way to implement an isUTF8-function can be found on php.net:

function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
}
harpax
+2  A: 

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called forceUTF8().

You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1) or UTF8, or the string can have a mix of the two. forceUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = forceUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = forceLatin1($utf8_or_latin1_or_mixed_string);

Download:

http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip

Update:

I've included another function, fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = fixUTF8($garbled_utf8_string);

Examples:

echo fixUTF8("Fédération Camerounaise de Football");
echo fixUTF8("Fédération Camerounaise de Football");
echo fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Sebastián Grignoli
Thank you very much, this is exactly what I was looking for :) But it would be best to have only one single function which does everything. So forceUTF8() should include fixUTF8()'s skills.
Well, if you look at the code, fixUTF8 simply calls forceUTF8 once and again until the string is returned unchanged. One call to fixUTF8() takes at least twice the time of a call to forceUTF8(), so it's a lot less performant. I made fixUTF8() just to create a command line program that would fix "encode-corrupted" files, but in a live environment is rarely needed.
Sebastián Grignoli
How does this convert non-UTF8 characters to UTF8, without knowing what encoding the invalid characters are in to begin with?
philfreo
It assumes ISO-8859-1, the answer already says this. The only difference between forceUTF8() and utf8_encode() is that forceUTF8() recognizes UTF8 characters and keeps them unchanged.
Sebastián Grignoli