views:

466

answers:

4

I would like to only work with UTF8. The problem is I don't know the charset of every webpage. How can I detect it and convert to UTF8?

<?php
$url = "http://vkontakte.ru";
$ch = curl_init($url);
$options = array(
    CURLOPT_RETURNTRANSFER => true,
);
curl_setopt_array($ch, $options);
$data = curl_exec($ch);

// $data = magic($data);

print $data;

See this at: http://paulisageek.com/tmp/curl-utf8

What is magic()?

A: 

You can try and use something like:

http://www.php.net/manual/en/function.mb-detect-encoding.php

http://www.php.net/manual/en/function.mb-convert-encoding.php

Although this is not fool proof.

Alec Smart
+2  A: 

The converting is easy. The detecting is the hard part. You could try mb_detect_encoding but that is a very shaky method, it's literally "guessing" the content type and as @troelskn highlights in the comments can guess "rough" differences at best (Is it a multi-byte encoding?) but fails at detecting nuances of similar character sets.

The proper way would be IMO:

  • Interpreting any content-type Meta tags in the page
  • Interpreting any content-type headers sent by the server
  • If that yields nothing, try to "sniff" the encoding using mb_detect_encoding()
  • If that yields nothing, fall back to a defined default (maybe ISO-8859-1, maybe UTF-8).

Different than outlined in the guidelines in @Gumbo's answer, I personally think Meta tags should have priority over server headers because I'm pretty sure that if a Meta tag is present, that is a more reliable indicator of the actual encoding of the page than a server setting some site operators don't even know how to change. The correct way, however, seems to be to treat content-type headers with higher priority.

For the former, I think you can use get_meta_tags(). The latter you should be getting from curl already, you would just have to parse it. Here is a full example on how to systematically process response headers served by cURL.

The conversion would then be using iconv:

$new_content = iconv("incoming-charset", "utf-8", $content);
Pekka
don't other people have to do this? I can't be the first to run across this problem. Isn't there existing code to detect this well?
Paul Tarjan
@Paul very good question! There ought to be a library, but I don't know any. If nothing else comes up, your best bet may be looking at PHP "Browser simulator" classes, whether any of those has this implemented well.
Pekka
http-headers should probably be given higher priority than meta tags.
troelskn
.. and `mb_detect_encoding` is bordering on unusable, since it can't distinguish between various `iso-8859-X` encodings. In particular, it won't be able to tell the difference between `iso-8859-1` and `cp-1252`.
troelskn
@troelskn Re Meta tags: Convention says yes, I say no for the reason outlined above - everyone take their pick. Re encoding detection: Very good point, I'm updating my answer accordingly.
Pekka
I was told by an experienced PHP programmer that it is possible that a machine on the way from server to client re-encodes the page for some reason (personally can't see any). When it do so it does not modify the content so meta tags are still the same but http headers are changed. It's why we should prefer http headers over meta tags. Does it make any sense?
Petr Peller
@Petr that indeed makes sense as a possibility, yes. I don't know how widespread that practice would be, though. Maybe worth a question of its own....
Pekka
A: 

There is a defined order how to specify the character encoding in HTML:

[…] conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  3. The charset attribute set on an element that designates an external resource.

If no character encoding declaration is present, HTTP defines ISO 8859-1 as default character encoding. You can either use that as default character encoding for HTML too or simply refuse to process the response.

For XHTML you additionally have the XML declaration as source for the encoding:

In an XML document, the character encoding of the document is specified on the XML declaration (e.g., <?xml version="1.0" encoding="EUC-JP"?>). In order to portably present documents with specific character encodings, the best approach is to ensure that the web server provides the correct headers. If this is not possible, a document that wants to set its character encoding explicitly must include both the XML declaration an encoding declaration and a meta http-equiv statement (e.g., <meta http-equiv="Content-type" content="text/html; charset=EUC-JP" />). In XHTML-conforming user agents, the value of the encoding declaration of the XML declaration takes precedence.

If no character encoding declaration, XML defines UTF-8 and UTF-16 as default character encoding:

Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

So, to sum up, the order is:

  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. XML declaration with encoding attribute.
  3. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".

If no character encoding declaration is present, you may assume ISO 8859-1 as default encoding for HTML and must assume UTF-8 or UTF-16 as default encoding for XHTML.

Gumbo
Wonderful. Is there a library for this protocol? I would like to do the curl and character conversion together and have UTF8 just returned
Paul Tarjan
@Paul Tarjan: You can the *Content-Type* header field with `curl_getinfo`.
Gumbo
I put your advice in a function, how does it look?
Paul Tarjan
A: 

Going by Gumbo and Pekka's advice, I wrote curl_exec_utf8

/** The same as curl_exec except tries its best to convert the output to utf8 **/
function curl_exec_utf8($ch) {
    $data = curl_exec($ch);
    if (!is_string($data)) return $data;

    unset($charset);
    $content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    /* 1: HTTP Content-Type: header */
    preg_match( '@([\w/+]+)(;\s*charset=(\S+))?@i', $content_type, $matches );
    if ( isset( $matches[3] ) )
        $charset = $matches[3];

    /* 2: <meta> element in the page */
    if (!isset($charset)) {
        preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s*charset=([^\s"]+))?@i', $data, $matches );
        if ( isset( $matches[3] ) )
            $charset = $matches[3];
    }

    /* 3: <xml> element in the page */
    if (!isset($charset)) {
        preg_match( '@<\?xml.+encoding="([^\s"]+)@si', $data, $matches );
        if ( isset( $matches[1] ) )
            $charset = $matches[1];
    }

    /* 4: PHP's heuristic detection */
    if (!isset($charset)) {
        $encoding = mb_detect_encoding($data);
        if ($encoding)
            $charset = $encoding;
    }

    /* 5: Default for HTML */
    if (!isset($charset)) {
        if (strstr($content_type, "text/html") === 0)
            $charset = "ISO 8859-1";
    }

    /* Convert it if it is anything but UTF-8 */
    /* You can change "UTF-8"  to "UTF-8//IGNORE" to 
       ignore conversion errors and still output something reasonable */
    if (isset($charset) && strtoupper($charset) != "UTF-8")
        $data = iconv($charset, 'UTF-8', $data);

    return $data;
}

The regexs are mostly from http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_content_type

Paul Tarjan
Ooohh sweet! I'm going to test drive this when I find the time.
Pekka