ansaurus

Question

How can I get html webpage charset encode from html as string and not as dom?

Answer 1

A:

you could use

mb_detect_encoding($html);

but it is generally a bad idea. Better use curl instead and look at the Content-Type header.

mvds 2010-07-31 21:24:54

I know that mb_detect_encoding($html) not working well.

Yosef 2010-07-31 21:32:12

Then maybe *"use curl instead and look at the Content-Type header"*

mvds 2010-07-31 21:36:35

Answer 2

+1 A:

First thing you have to check the Content-type header.

//add error handling
$f = fopen($url, "r");
$md = stream_get_meta_data($f);
$wd = $md["wrapper_data"];
foreach($wd as $response) {
    if (preg_match('/^content-type: .+?/.+?;\\s?charset=([^;"\\s]+|"[^;"]+")/i',
             $response, $matches) {
         $charset = $matches[1];
         break;
    }
}
$data = stream_get_contents($f);

You can then fallback on the meta element. That's been answered before here.

More complex version of header parsing to please the audience:

if (preg_match('~^content-type: .+?/[^;]+?(.*)~i', $response, $matches)) {
    if (preg_match_all('~;\\s?(?P<key>[^()<>@,;:\"/[\\]?={}\\s]+)'.
            '=(?P<value>[^;"\\s]+|"[^;"]+")\\s*~i', $matches[1], $m)) {
        for ($i = 0; $i < count($m['key']); $i++) {
            if (strtolower($m['key'][$i]) == "charset") {
                $charset = trim($m['value'][$i], '"');
            }
        }
    }
}

Artefacto 2010-07-31 21:29:51

what happened to pattern delimiters and case sensitivity?

mvds 2010-07-31 21:33:01

regex has no delims and that greedy capture is gonna give a lot more than you want back

Crayon Violent 2010-07-31 21:33:56

why dont you use file_get_contents instead fopen?I need to get html to other tesks after

Yosef 2010-07-31 21:34:03

@Crayon I forgot the delimiters, but I had non-greedy quantifiers there all the time.

Artefacto 2010-07-31 21:34:49

@Yosef Because I needed to get the headers for the request. `file_get_contents` returns a string immediately so you have to change to fetch them.

Artefacto 2010-07-31 21:35:30

really? well what do you call (.*) then?

Crayon Violent 2010-07-31 21:35:58

@Crayon: greedy but it will not eat a newline.

mvds 2010-07-31 21:37:44

@Crayon That's greedy, but it's the last thing in the expression; it doesn't make any difference.

Artefacto 2010-07-31 21:37:47

but that's assuming that's the last thing on the line...

Crayon Violent 2010-07-31 21:39:11

@Crayon It will be, unless the server is violating the HTTP protocol.

Artefacto 2010-07-31 21:41:24

@Crayon I think you're mistaking HTTP headers for HTML data.

Artefacto 2010-07-31 21:44:42

@Artefacto: can you point me to the RFC stating that there can be only one parameter in the Content-Type header? I can only find ` `media-type = type "/" subtype *( ";" parameter )` in section 3.7 of rfc 2616, where `*` denotes repetition. So in theory `(.*)` might break one day.

mvds 2010-07-31 21:48:36

@mvds Damn you and your references :p All right, I'll fix it.

Artefacto 2010-07-31 21:51:49

@Artefacto: since you're getting the points here.. is the charset always the first parameter? ;-)

mvds 2010-07-31 22:12:59

@mvds Ah, I've already hit the cap a few hours ago. I'll fix it, though.

Artefacto 2010-07-31 22:23:38

@mvds I hope you're happy now. If not, edit it yourself :p

Artefacto 2010-07-31 22:52:03

@Artefacto I'm impressed!

mvds 2010-07-31 23:03:17

Answer 3

+1 A:

preg_match('~charset=([-a-z0-9_]+)~i',$html,$charset);

Crayon Violent 2010-07-31 21:31:17

this seems to suppose that `$html` contains the http header, which it does not.

mvds 2010-07-31 21:38:52

Please no. What if I happen to be parsing a page that explains how to define the encoding of a page?...

Artefacto 2010-07-31 21:40:34

...then you find out what it will be encoded as anyways?

Crayon Violent 2010-07-31 21:41:34

That's assuming it happens before. `meta` can come after the `title` tag, an old `meta` tag may be commented out, etc etc. This is also not a good solution because the HTTP headers have priority.

Artefacto 2010-07-31 21:44:08

I will concede to commented out tags, But overall, he asked for a regex given his current code which uses file_get_contents() to get the html. That is what I gave him.

Crayon Violent 2010-07-31 21:46:21

Thanks this exactly what i need, after i check your regex works great!

Yosef 2010-07-31 22:08:35

ansaurus

tags:

views:

answers:

How can I get html webpage charset encode from html as string and not as dom?

related questions