ansaurus

Question

php xpath - get only tag attributes/remove inner tag contents

Answer 1

A:

EDIT: About the head element - you want to get only the attributes of the head element, you can use xpath( "//head" ) and then $head->attributes.

I won't directly answer your question which is not very full of details, but I will rather tell a story about my own experience. I believe that you can solve your problems if you understand the implications of the examples I am giving.

I understand from the tags that you want to use PHP on the job. I had a similar problem lately, where I had to parse around 100 static html documents, and extract parts of the information to place it in a database. Initially I thought about regular expressions, but as I went along I saw that will be a tedious task.

So I ended up messing with XPath and SimpleXML in PHP.

Here is how I ended up:

$file_contents = file_get_contents( $file );
$dom = new DOMDocument;
$dom->loadHTML( $file_contents );
$document = simplexml_import_dom( $dom );

Now I have a SimpleXML object which holds the HTML code. That is really great - here is how it rolls:

suppose you have the following html code:

<div id="content">
<div class="description">
 <dl>
     <dt>Title</dt>
     <dd>
         <ul><li> first item </li> <li> second item</li></ul>
         <p> a paragraph.. </p>
     </dd>
 </dl>
</div>
</div>

Now, you can iterate over all the <dl> elements in your code, which are children of div#description and grandchildren of div#content like that:

foreach( $document->xpath( "//div[@id='content']/div[@class='description']/dl" ) as $element )

and then all the children are parsed through a recursive function like this one:

function recurse( $parent )
{
echo '<' . $parent->getName() . '>' . "\n";
#echo $parent # you might want to strip any white spaces like \t and \n here

foreach( $parent->children() as $child )
{
    if( count( $child->children() ) > 0 )
    {
        recurse( $child );
    }
    else
    {
       echo '<' . $child->getName() . '>';
       echo $child;
       echo '</' . $child->getName() . '>' . "\n";
    }
}
echo '</' . $parent->getName() . '>' . "\n";
}

I hope that I've been of help, good luck!

Petrunov 2009-05-02 09:27:53

regards your edit - yes that is true but it is part of a function that traverses the html tree structure which means it will add the information whether I want it or not, unless I specify every type of tag (node) I want it to ignore (which is annoying to me :) )

EddyR 2009-05-02 15:57:53

Answer 2

+1 A:

In that case perhaps a preg_match like this one might be what you need?

preg_match( '/<head (.*)>/', $file_contents, $matches );
echo ( isset( $matches[1] ) ) ? $matches[1] : '';

Petrunov 2009-05-02 19:14:51

Answer 3

A:

which is irrelevant because I just need the tag attributes

I am not sure where are the attributes in your example. And am no PHP xpath implementation expert.

However you may try the following:

use the text() xpath function at the end of your expression (e.g. "/html/head/text()") to get only the text nodes, not tags
the xpath function should return a NodeList. You shoud use that to get an entire fragment XML - e.g. DOMXpath does just that.

Vlagged 2009-05-03 21:15:28

ansaurus

tags:

views:

answers:

php xpath - get only tag attributes/remove inner tag contents

related questions