ansaurus

Question

How to avoid DOM parsing adding html doctype, had and body tags?

Answer 1

A:

I'm not sure if either of these will actually work, but you could try using DOMImplementation::createDocument when constructing your DOMDocument - the third argument is the DOCTYPE you wish to use.

Also, instead of saveHTML(), you could try saveXML()

nickf 2009-10-06 22:01:51

If you're not sure they'll work, don't post them.

Nick Stinemates 2009-10-06 22:10:08

It doesn't work.. saveXML() adds xml doctype

ile 2009-10-06 22:17:14

geez, sorry for trying to *help*, but I don't have the time to set up tests and recreate the situations described by every question I try to answer. If it's so important to you, why not test out these methods *your*self before downvoting?

nickf 2009-10-06 22:18:07

hey kids, it's time for sleep :)))

ile 2009-10-06 22:23:09

Answer 2

A:

You could always just use a regex to strip that first bit out:

echo preg_replace("/<!DOCTYPE [^>]+>/", "", $dom->saveHTML());

nickf 2009-10-07 02:21:27

That would probably solve the problem but then using DOM parsing hasn't got point... I used it at first place to avoid regular expressions. Although, I think I'll be forced to use it.Thanks for your answer

ile 2009-10-07 09:26:37

well, generally you want to avoid using regexes for xml/html because the markup syntax is rather lax and writing a regex which would take it all into consideration is difficult. In this case however, you have a very definite structure with known output, so it's very easy to work with. Don't just blindly follow the "HTML + REGEX == BAD!!" crowd if it doesn't make sense.

nickf 2009-10-07 21:58:33

Answer 3

+1 A:

I'm actually looking for the same solution. I've been using an innerHTML method to do this, however the <p> around the text node will still be added when you do loadHTML. I don't there's a way to get around that without using another parser, or there's some hidden flag to tell it to not do that.

This code:

<?php

function innerHTML($node){
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child)
    $doc->appendChild($doc->importNode($child, true));

  return $doc->saveHTML();
}

 $string = '
    Some photos<br>
    <span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />    
    ';

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML($string);
    $elements = $dom->getElementsByTagName('span');
    $spans = array();
    foreach($elements as $span) {
        $spans[] = $span;
    }
    foreach($spans as $span) {
        $span->parentNode->removeChild($span);
    }

    echo innerHTML( $dom->documentElement->firstChild );

Will output:

<p>Some photos<br><br><br><br><br></p>

However of course this solution does not keep the markup 100% intact, but it's close.

meder 2009-10-07 02:50:52

This is very close and good solution. I tried to remove <p> and </p> from the end and from beginning of the string using trim function, but it always removes just the first, opening tag, closing tag </p> can't be removed...Thank you for comment, I hope someone will have solution to remove 'p' tags

ile 2009-10-07 09:33:30

ansaurus

tags:

views:

answers:

How to avoid DOM parsing adding html doctype, had and body tags?

related questions