tags:

views:

412

answers:

2

I am using simple html dom parser to parse some html.

I have an html like this

<span class="UIStory_Message">
    Yeah, elixir of life!<br/>
   <a href="asdfasdf">
      <span>asdfsdfasdfsdf</span>
       <wbr/>
       <span class="word_break"/>
       61193133389&ref=nf
   </a>
</span>

My code is

$storyMessageNodes    = $story->find('span.UIStory_Message');
$storyMessage         = strip_tags($storyMessageNodest->innertext);

I want to get the text right inside the span "UIStory_Message". ie, "Yeah, elixir of life!".

but the above code gives me the whole text which is inside the whole span. ie, "Yeah, elixir of life! asdfsdfasdfsdf 61193133389&ref=nf "

how could i code so that it gives only "Yeah, elixir of life!" ??

+1  A: 

You can do something like this:

$result = $story->find('span.UIStory_Message');

And then substr() on the first <; one other option is to write a simple regular expression.


I've not tested, this is just a wild guess based on the documentation, try doing:

$story->find('span.UIStory_Message')->plaintext; // same result as strip_tags()?

Or:

$story->find('span.UIStory_Message')->find('text');

If that doesn't work, try playing with these options.

Alix Axel
I know tht will work.... but iwant to know if there is any direct methods in simple_html_dom.php for doing this??
Jasim
A: 

I've written a method to get rid of unneeded elements in fetched DOM nodes, I've contacted the author, but simple dom has not been active for two years so I doubt he will include it in the distribution. Here it is:

/**
 * remove specified nodes from selected dom
 *
 * @param string $selector
 * @param int|array (optional) possible values include:
 *   + positive integer - remove first denoted number of elements
 *   + negative integer - remove last denoted number of elements
 *   + array of ones and zeroes - remove the respective matches that equal to one
 *
 * eg.
 *   // will remove first two images found in node
 *   $dom->removeNodes('img',2);
 *
 *   // will remove last two images found in node
 *   $dom->removeNodes('img',-2);
 *
 *   // will remove all but the third images found in node
 *   $dom->removeNodes('img',array(1,1,0,1));
 *
 * [!!!] if there are more matches found than elements in array, the last array member will be used for processing
 *
 * eg.
 *   // will remove second and every following image
 *   $dom->removeNodes('img',array(0,1));
 *
 *   // will remove only the second image
 *   $dom->removeNodes('img',array(0,1,0));
 *
 * @return simple_html_dom_node
 */
public function removeNodes($selector, $limit = NULL)
{
    $elements = $this->find($selector);
    if ( empty($elements) ) return $this;


    if ( isset($limit) && is_int( $limit ) && $limit < 0 ) {
        $limit = abs( $limit );
        $elements = array_reverse( $elements );
    }

    foreach ( $elements as $element ) {

        if ( isset($limit) ) {

            if ( is_array( $limit ) ) {
                $current = current( $limit );
                if ( next( $limit ) === FALSE ) {
                    end( $limit );
                }
                if ( !$current ) {
                    continue;
                }
            } else {
                if ( --$limit === -1 ) {
                    return $this;
                }
            }
        }

        $element->outertext = '';

    }

    return $this;
}

put it in simple_html_dom_node class or one extending it. In the askers case you'd use it like this:

$storyMessageNodes = $story->find('span.UIStory_Message');
$storyMessage = $storyMessageNodes[0]->removeNodes('a')->plaintext
Raveren