views:

226

answers:

2

Hello,

I would like to determine a remote page's encoding through detection of the Content-Type header tag

<meta http-equiv="Content-Type" content="text/html; charset=XXXXX" />

if present.

I retrieve the remote page and try to do a regex to find the required setting if present. I am still learning hence the problem below... Here is what I have:

    $EncStart = 'charset=';
    $EncEnd = '" \/\>';
    preg_match( "/$EncStart(.*)$EncEnd/s", $RemoteContent, $RemoteEncoding );
    echo = $RemoteEncoding[ 1 ];

The above does indeed echo the name of the encoding but it does not know where to stop so it prints out the rest of the line then most of the rest of the remote page in my test. Example: When testing a remote russian page it printed:

windows-1251" />
rest of page ....

Which means that $EncStart was okay, but the $EncEnd part of the regex failed to stop the matching. This meta header usually ends in 3 different possibility after the name of the encoding.

"> | "/> | " />

I do not know weather this is usable to satisfy the end of the maching and if yes how to escape it. I played with different ways of doing it but none worked.

Thank you in advance for lending a hand.

A: 

add a question mark to your pattern to make it non-greedy (and there's also no need of 's')

preg_match( "/charset=\"(.+?)\"/", $RemoteContent, $RemoteEncoding );
echo $RemoteEncoding[ 1 ];

note that this won't handle charset = "..." or charset='...' and many other combinations.

stereofrog
That's what I needed. The only issue with your regex is that you allowed for a ["] after the [=] where there are none. After Taking that out with its backslash, it worked as required with a few examples. Keeping your note in mind as I look at the other suggestions as well. Thank you.
Yallaa
A: 

Take a look at Simple HTML Dom Parser. With it, you can easily find the charset from the head without resorting to cumbersome regexes. But as David already commented, you should also examine the headers for the same information and prioritize it if found.

Tested example:

require_once 'simple_html_dom.php';

$source = file_get_contents('http://www.google.com');
$dom = str_get_html($source);
$meta = $dom->find('meta[http-equiv=content-type]', 0);
$src_charset = substr($meta ->content, stripos($meta ->content, 'charset=') + 8);

foreach ($http_response_header as $header) {
    @list($name, $value) = explode(':', $header, 2);
    if (strtolower($name) == 'content-type') {
        $hdr_charset = substr($value, stripos($value, 'charset=') + 8);
        break;
    }
}

var_dump(
    $hdr_charset,
    $src_charset
);
nikc
Also, downloaded Simple HTML Dom Parser and looking into that as well.thank you for the suggestion,
Yallaa