views:

399

answers:

3
<?
    $string = '
    Some photos<br>
    <span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br /> 
    ';

    $dom = new DOMDocument();
    $dom->loadHTML($string);
    $dom->preserveWhiteSpace = false;
    $elements = $dom->getElementsByTagName('span');
    $spans = array();
    foreach($elements as $span) {
     $spans[] = $span;
    }
    foreach($spans as $span) {
     $span->parentNode->removeChild($span);
    }
    echo $dom->saveHTML();


?>

I'm using this code to parse strings. When string is returned by this function, it has some added tags:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><body><p>Some photos<br><br><br><br><br></p></body></html>

Is there any way to avoid this and to have clean string returned? This input string is just for example, in usage it can be any html string

Ile

A: 

I'm not sure if either of these will actually work, but you could try using DOMImplementation::createDocument when constructing your DOMDocument - the third argument is the DOCTYPE you wish to use.

Also, instead of saveHTML(), you could try saveXML()

nickf
If you're not sure they'll work, don't post them.
Nick Stinemates
It doesn't work.. saveXML() adds xml doctype
ile
geez, sorry for trying to *help*, but I don't have the time to set up tests and recreate the situations described by every question I try to answer. If it's so important to you, why not test out these methods *your*self before downvoting?
nickf
hey kids, it's time for sleep :)))
ile
A: 

You could always just use a regex to strip that first bit out:

echo preg_replace("/<!DOCTYPE [^>]+>/", "", $dom->saveHTML());
nickf
That would probably solve the problem but then using DOM parsing hasn't got point... I used it at first place to avoid regular expressions. Although, I think I'll be forced to use it.Thanks for your answer
ile
well, generally you want to avoid using regexes for xml/html because the markup syntax is rather lax and writing a regex which would take it all into consideration is difficult. In this case however, you have a very definite structure with known output, so it's very easy to work with. Don't just blindly follow the "HTML + REGEX == BAD!!" crowd if it doesn't make sense.
nickf
+1  A: 

I'm actually looking for the same solution. I've been using an innerHTML method to do this, however the <p> around the text node will still be added when you do loadHTML. I don't there's a way to get around that without using another parser, or there's some hidden flag to tell it to not do that.

This code:

<?php

function innerHTML($node){
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child)
    $doc->appendChild($doc->importNode($child, true));

  return $doc->saveHTML();
}

 $string = '
    Some photos<br>
    <span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />    
    ';

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML($string);
    $elements = $dom->getElementsByTagName('span');
    $spans = array();
    foreach($elements as $span) {
        $spans[] = $span;
    }
    foreach($spans as $span) {
        $span->parentNode->removeChild($span);
    }

    echo innerHTML( $dom->documentElement->firstChild );

Will output:

<p>Some photos<br><br><br><br><br></p>

However of course this solution does not keep the markup 100% intact, but it's close.

meder
This is very close and good solution. I tried to remove <p> and </p> from the end and from beginning of the string using trim function, but it always removes just the first, opening tag, closing tag </p> can't be removed...Thank you for comment, I hope someone will have solution to remove 'p' tags
ile