views:

309

answers:

1

So I'm trying to parse HTML pages and looking for paragraphs (<p>) using get_elements_by_tag_name('p');

The problem is that when I use $element->nodeValue, it's returning weird characters. The document is loaded first into $html using curl then loading it into a DomDocument.

I'm sure it has to do with charsets.

Here's an example of a response: "aujourd’hui".

Thanks in advance.

+1  A: 

This is an encoding issue. try explicitly setting the encoding to UTF-8.

this should help: http://devzone.zend.com/article/8855

prodigitalson
Already tried that and it didn't work... The funny thing is that if I do $doc->saveHTML(), the returning html's encoding is totally correct.
Elie
Whats the `<meta http-equiv="Content-type" ... />` specified in the HTML?
prodigitalson