ansaurus

Question

PHP- HTML parsing :: How can be taken charset value of webpage with simple html dom parser?

Answer 1

+1 A:

You'll have to match the string using a regular expression (I hope you have PCRE...).

$el=$html->find('meta[http-equiv=Content-Type]',0)
$fullvalue = $el->content;
preg_match('/charset=(.+)/', $fullvalue, $matches);
echo $matches[1];

Not very robust, but should work.

MvanGeest 2010-07-28 18:29:35

Thanks! I fix a bit and its works see my answer fix.$html = file_get_html('http://www.google.com/');$el=$html->find('meta[content]',0);$fullvalue = $el->content;preg_match('/charset=(.+)/', $fullvalue, $matches);echo substr($matches[0], strlen("charset="));

Yosef 2010-07-28 18:49:29

**Don't do that**, I made a mistake. It should be `$matches[1]`. That makes it a lot faster and more reliable.

MvanGeest 2010-07-28 18:52:22

Answer 2

+1 A:

$dd = new DOMDocument;
$dd->loadHTML($data);
foreach ($dd->getElementsByTagName("meta") as $m) {
    if (strtolower($m->getAttribute("http-equiv")) == "content-type") {
        $v = $m->getAttribute("content");
        if (preg_match("#.+?/.+?;\\s?charset\\s?=\\s?(.+)#i", $v, $m))
            echo $m[1];
    }
}

Note that the DOM extension implicitly converts all the data to UTF-8.

Artefacto 2010-07-28 18:30:15

Now that's a bit more robust than what I wrote... :)

MvanGeest 2010-07-28 18:31:12

Thanks for this option, because its very important to have utf-8 data.

Yosef 2010-07-28 18:34:14

@Mva yeah, Content-Type is sometimes written "Content-type". At least in the http headers, case doesn't matter.

Artefacto 2010-07-28 18:35:13

DomDocument not convert proper text always to utf-8. I still working to handle this problem.

Yosef 2010-07-30 13:48:39

Answer 3

A:

Thanks for MvanGeest answer - I just fix a bit and its works perfect.

$html = file_get_html('http://www.google.com/');
$el=$html->find('meta[content]',0);
$fullvalue = $el->content;
preg_match('/charset=(.+)/', $fullvalue, $matches);
echo substr($matches[0], strlen("charset="));

Yosef 2010-07-28 18:48:46

**Don't do that** - see the correction of my answer.

MvanGeest 2010-07-28 18:53:01

Your fix not working

Yosef 2010-07-28 19:49:01

Weird... it's working for me. You don't need the `substr` though... just `$matches[1]`. I tested it using Google.

MvanGeest 2010-07-28 22:01:39

ansaurus

tags:

views:

answers:

PHP- HTML parsing :: How can be taken charset value of webpage with simple html dom parser?

Edit:

related questions