views:

188

answers:

4
$dom = new DOMDocument();
$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
$elements = $dom->getElementsByTagName('span');
$spans = array();
foreach($elements as $span) {
 $spans[] = $span;
}
foreach($spans as $span) {
 $span->parentNode->removeChild($span);
}
return $dom->saveHTML(); 
//return $string;

When I use this code to parse string it changes encoding and symbols are not shown the same when return $string is uncommented. Why is that so and how to avoid charset change

Ile

+2  A: 

Try to set the encoding in the constructor or with DOMDocument->encoding:

$dom = new DOMDocument('1.0', '…');
// or
$dom = new DOMDocument();
$dom->encoding = '…';
Gumbo
A: 

this $dom->encoding = "utf-8"; doesn't work...

ile
Please use the comment function to respond to an answer.
Gumbo
I didn't know how to put "code block" in comment so I used answer. But thanks for tip!
ile
+1  A: 

Unfortunately, it seems that DOMDocument will automatically convert all characters to HTML entities unless it knows the encoding of the original document.

Apparently, one option is to add a <meta> tag with the content type/encoding to the original string, but this means that it will be present in the output as well. Removing it might not be so easy.

Another option I can think of is manually decoding the HTML entities, using a code like this:

$trans = array_flip(get_html_translation_table(HTML_ENTITIES));
unset($trans["&quot;"], $trans["&lt;"], $trans["&gt;"], $trans["&amp;"]);
echo strtr($dom->saveHTML(), $trans);

This is a seriously ugly solution, but I can't think of anything else, other than using a different HTML parser. :(

Lukáš Lalinský
Definitely, I must store data to database in utf-8 encoding. That's the only situation when DOMDocument works.Btw, I'm not pretty sure how to use this solution of yours. What is actually $trans variable containing?Thanks,Ile
ile
A: 

There is also one interesting thing I noticed today... I didn't realized why it happens but it's very strange behavior... code from the top is set to function. When string is passed to function and after function process it to returned string is added <doctype...> <html><body>STRING</body></html> in some unexplainable cases: Data is loaded from database and when this data from db is directly proceeded to function it doesnt add this extra tags, but when data is first stored to variable and than this function is called somewhere below these extra values are added. Also one strange thing... I had a case when I called this extra function to process string and few lines below I added trim function it returned me error from dom function... and when I delete this trim function (that was called AFTER the dom function) the error disappeared... Any reasonable explanation?

ile