views:

97

answers:

7

How can I select the string contents of the following nodes:

<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>

I have tried a few things

//span/text()

Doesn't get the bold tag

//span/string(.)

is invalid

string(//span)

only selects 1 node

I am using simple_xml in php and the only other option I think is to use //span which returns:

Array
(
    [0] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => url
                )

            [b] => test
        )

    [1] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => url
                )

            [b] => test2
        )

)

*note that it is also dropping the "more words" text from the second span.

So I guess I could then flatten the item in the array using php some how? Xpath is preferred, but any other ideas would help too.

A: 
//span//text()

This may be the best you can do. You'll get multiple text nodes because the text is stored in separate nodes in the DOM. If you want a single string you'll have to just concatenate the text nodes yourself since I can't think of a way to get the built-in XPath functions to do it.

Using string() or concat() won't work because these functions expect string arguments. When you pass a node-set to a function expecting a string, the node-set is converted to a string by taking the text content of the first node in the node-set. The rest of the nodes are discarded.

John Kugelman
+3  A: 
$xml = '<foo>
<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[@class='url']") as $node) echo $node->textContent;
Wrikken
This is what I was looking for. Thanks.
spyderman4g63
+3  A: 

You dont even need an XPath for this:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
    if(in_array('url', explode(' ', $span->getAttribute('class')))) {
        $span->nodeValue = $span->textContent;
    }
}
echo $dom->saveHTML();

EDIT after comment below

If you just want to fetch the string, you can do echo $span->textContent; instead of replacing the nodeValue. I understood you wanted to have one string for the span, instead of the nested structure. In this case, you should also consider if simply running strip_tags on the span snippet wouldnt be the faster and easier alternative.


With PHP5.3 you can also register arbitrary PHP functions for use as callbacks in XPath queries. The following would fetch the content of all span elements and it's child nodes and return it as a single string.

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');

// Custom Callback function
function nodeTextJoin($nodes)
{
    $text = '';
    foreach($nodes as $node) {
        $text .= $node->textContent;
    }
    return $text;
}
Gordon
I'm not sure that's what the OP is asking for. What this does is printout the whole document with all markup under the <span> tags removed. i.e. the first span element is now `<span class="url">word test</span>` instead of `<span class="url">word <b class=" ">test</b></span>`
Alexandre Jasmin
@Alexandra the OPs comment below the question reads *The main goal is to return 1 string for each span.*. I interpreted this as replace the original string, but now that you say it, yes, might be wrong.
Gordon
Yeah, my main goal was to convert the contents of the span to a string. simple xml was taking the tags and converting them to an array.
spyderman4g63
Hmm, never really _needed_ the `registerPHPFunctions`, but it would have saved quite some time in the past. Noted!
Wrikken
@Wrikken I've yet to find a real need for them too. The main downside is having to write `php:function("functioname", ...` and `php:functionString("functioname", ...` - that's just so cumbersome. And your XPath queries will no longer be portable to other languages then. But, since it's possible and it's not a well known feature, I thought I add them here. @salathe made a blog entry about this at http://cowburn.info/2009/10/23/php-funcs-xpath/
Gordon
Wrikken
@Wrikken yup, actually I remembered the possibility to use PHP functions when Alejandro mentioned XPath2 in his answer.
Gordon
A: 

SimpleXML doesn't like mixing text nodes with other elements, that's why you're losing some content there. The DOM extension, however, handles that just fine. Luckily, DOM and SimpleXML are two faces of the same coin (libxml) so it's very easy to juggle them. For instance:

foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
    // will not work as expected
    echo $span;

    // will work as expected
    echo textContent($span);
}

function textContent(SimpleXMLElement $node)
{
    return dom_import_simplexml($node)->textContent;
}
Josh Davis
Interesting. But it's just simpler to just use the DOM for everything as in @Wrikken answer
Alexandre Jasmin
DOM is an order of magnitude more complicated to use than SimpleXML but yeah, whatever works for you.
Josh Davis
@Josh Sorry. I don't mean we should use the DOM all the time. DOM code can get horribly verbose. But in the context of this simple task I don't see the point of mixing the two APIs. In fact you save a few keystrokes by not calling dom_import_simplexml() in this case
Alexandre Jasmin
A: 

How can I select the string contents of the following nodes:

First, I think your question is not clear.

You could select the descendant text nodes as John Kugelman has answer with

//span//text()

I recommend to use the absolute path (not starting with //)

But with this you would need to process the text nodes finding from wich parent span they are childs. So, it would be better to just select the span elements (as example, //span) and then process its string value.

With XPath 2.0 you could use:

string-join(//span, '.')

Result:

word test. word test2 more words

With XSLT 1.0, this input:

<div>
<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>
</div>

With this stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:output method="text"/>
    <xsl:template match="span[@class='url']">
        <xsl:value-of select="concat(substring('.',1,position()-1),normalize-space(.))"/>
    </xsl:template>
</xsl:stylesheet>

Output:

word test.word test2 more words
Alejandro
[DOM uses libxml](http://www.php.net/manual/en/dom.requirements.php) and [libxml does not support XPath 2.0](http://xmlsoft.org/index.html)
Gordon
@Gordon: "but any other ideas would help too."
Alejandro
@Alejandro just saying, in case anybody tries and wonders why it wont work
Gordon
@Gordon: And I add a XPath 2.0 solution because It would be good that more people know its new features and update their platform or request vendors to do so.
Alejandro
+2  A: 

Using XMLReader:

$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
    if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
        echo $xmlr->readString();
    }
}

Output:

word
test

word
test2
more words
GZipp
A: 

Along the lines of Alejandro's XSLT 1.0 "but any other ideas would help too" answer...

XML:

<?xml version="1.0" encoding="UTF-8"?>
<div>
    <span class="url">
        word
        <b class=" ">test</b>
    </span>
    <span class="url">
        word
        <b class=" ">test2</b>
        more words
    </span>
</div>

XSL:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:output method="text"/>
    <xsl:template match="span">
        <xsl:value-of select="normalize-space(data(.))"/>
    </xsl:template>
</xsl:stylesheet>

OUTPUT:

word test
word test2 more words
DevNull
Thanks. I'm pretty sure this would work if I was going to go with XSL, but the xpath example is better for the little thing I am doing. I get used to some custom extension we use at work that are not in EXSLT also.
spyderman4g63
@DevNull: `fn:data()` is XPath 2.0, so I think you should say this solution is **XSLT 2.0**
Alejandro