views:

417

answers:

3

I'm using XPath to select an section from an HTML page. However when I use XPath to extract the node, it correctly selects only the text surrounding the HTML tags and not the HTML tags themselves.

Sample HTML

<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>

I have the following XPath

/body/div

I get the following

At first glance you may ask, &#8220;what do you mean?&#8221; It means that we want to help figure...

I want

At first glance you may ask, &#8220;what <i>exactly</i> do you mean?&#8221; It means that we want to help <b>you</b> figure...

If you notice in the Sample HTML there is a <i/> and <b /> HTML tags in the content. The words within those tags are "lost" when I extract the content.

I'm using SimpleXML in PHP if that makes a difference.

+2  A: 

Your XPath is fine, though you can remove the final /. as that's redundant:

/atom/content

All of the HTML is inside of a <![CDATA ]]> section so in the XML DOM you actually only have text there. The <i> and <b> tags will not be parsed as tags but will just show up as text. Using a CDATA section is exactly the same as if your XML were written like this:

<atom>
    <content>
      At first glance you may ask, &amp;#8220;what &lt;i&gt;exactly&lt;/i&gt;
      do you mean?&amp;#8221; It means that we want to help &lt;b&gt;you&lt;/b&gt; figure...
    </content>
</atom>

So, it's whatever you're doing with the <content> element afterwards that's dropping those tags. Are you later parsing the text as HTML, or running it through a filter, or something like that?

John Kugelman
Removed the trailing period... however the question has changed somewhat.
null
I don't think XPath is the problem, so can you post your PHP code?
John Kugelman
A: 

I don't know if SimpleXML is different but to me it seems you need to make sure you're selecting all node types and not just text. In standard XPath you would do /body/div/node()

ChrisCM
+1  A: 

SimpleXML doesn't like text nodes so you'll have to use a custom solution instead.

You can use asXML() on each div element then remove the div tags, or you can convert the div elements to DOMNodes then loop over $div->childNodes and serialize each child. Note that your HTML entities will most likely be replaced by the actual characters if available.

Alternatively, you can take a look at the SimpleDOM project and use its innerHTML() method.

$html = 
'<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>';

$body = simpledom_load_string($html);

foreach ($body->xpath('/body/div') as $div)
{
    var_dump($div->innerHTML());
}
Josh Davis