views:

35

answers:

3

Hi,

How can I get html webpage charset encode from html as string and not as dom?

I get html string like that:

$html = file_get_contents($url);
preg_match_all (string pattern, string subject, array matches, int flags)

but i dont know regex, and I need to find out webpage charset (UTF-8/windows-255/etc..) Thanks,

A: 

you could use

mb_detect_encoding($html);

but it is generally a bad idea. Better use curl instead and look at the Content-Type header.

mvds
I know that mb_detect_encoding($html) not working well.
Yosef
Then maybe *"use curl instead and look at the Content-Type header"*
mvds
+1  A: 

First thing you have to check the Content-type header.

//add error handling
$f = fopen($url, "r");
$md = stream_get_meta_data($f);
$wd = $md["wrapper_data"];
foreach($wd as $response) {
    if (preg_match('/^content-type: .+?/.+?;\\s?charset=([^;"\\s]+|"[^;"]+")/i',
             $response, $matches) {
         $charset = $matches[1];
         break;
    }
}
$data = stream_get_contents($f);

You can then fallback on the meta element. That's been answered before here.

More complex version of header parsing to please the audience:

if (preg_match('~^content-type: .+?/[^;]+?(.*)~i', $response, $matches)) {
    if (preg_match_all('~;\\s?(?P<key>[^()<>@,;:\"/[\\]?={}\\s]+)'.
            '=(?P<value>[^;"\\s]+|"[^;"]+")\\s*~i', $matches[1], $m)) {
        for ($i = 0; $i < count($m['key']); $i++) {
            if (strtolower($m['key'][$i]) == "charset") {
                $charset = trim($m['value'][$i], '"');
            }
        }
    }
}
Artefacto
what happened to pattern delimiters and case sensitivity?
mvds
regex has no delims and that greedy capture is gonna give a lot more than you want back
Crayon Violent
why dont you use file_get_contents instead fopen?I need to get html to other tesks after
Yosef
@Crayon I forgot the delimiters, but I had non-greedy quantifiers there all the time.
Artefacto
@Yosef Because I needed to get the headers for the request. `file_get_contents` returns a string immediately so you have to change to fetch them.
Artefacto
really? well what do you call (.*) then?
Crayon Violent
@Crayon: greedy but it will not eat a newline.
mvds
@Crayon That's greedy, but it's the last thing in the expression; it doesn't make any difference.
Artefacto
but that's assuming that's the last thing on the line...
Crayon Violent
@Crayon It will be, unless the server is violating the HTTP protocol.
Artefacto
@Crayon I think you're mistaking HTTP headers for HTML data.
Artefacto
@Artefacto: can you point me to the RFC stating that there can be only one parameter in the Content-Type header? I can only find ` `media-type = type "/" subtype *( ";" parameter )` in section 3.7 of rfc 2616, where `*` denotes repetition. So in theory `(.*)` might break one day.
mvds
@mvds Damn you and your references :p All right, I'll fix it.
Artefacto
@Artefacto: since you're getting the points here.. is the charset always the first parameter? ;-)
mvds
@mvds Ah, I've already hit the cap a few hours ago. I'll fix it, though.
Artefacto
@mvds I hope you're happy now. If not, edit it yourself :p
Artefacto
@Artefacto I'm impressed!
mvds
+1  A: 

preg_match('~charset=([-a-z0-9_]+)~i',$html,$charset);

Crayon Violent
this seems to suppose that `$html` contains the http header, which it does not.
mvds
Please no. What if I happen to be parsing a page that explains how to define the encoding of a page?...
Artefacto
...then you find out what it will be encoded as anyways?
Crayon Violent
That's assuming it happens before. `meta` can come after the `title` tag, an old `meta` tag may be commented out, etc etc. This is also not a good solution because the HTTP headers have priority.
Artefacto
I will concede to commented out tags, But overall, he asked for a regex given his current code which uses file_get_contents() to get the html. That is what I gave him.
Crayon Violent
Thanks this exactly what i need, after i check your regex works great!
Yosef