tags:

views:

1585

answers:

3

How do I:

  1. Hide everything between between the head tags in xpath "/html/head" query?

For example on "<html><head><title>some title</title>some text</head>..." will produce nodeValue = "some title some text" which is irrelevant because I just need the tag attributes and I don't need to add irrelevant data to my database.

  1. Hide all child/descendant nodes in xpath "/html/body" query?

For example on "<html><body><div>some anchor</div>some text</body>..." will produce nodeValue = "some anchor some text" "some text" is relevant to the body tag and I do need to keep it and get ride of everything else.

Also I don't want to remove them from the dom document altogether!

A: 

EDIT: About the head element - you want to get only the attributes of the head element, you can use xpath( "//head" ) and then $head->attributes.

I won't directly answer your question which is not very full of details, but I will rather tell a story about my own experience. I believe that you can solve your problems if you understand the implications of the examples I am giving.

I understand from the tags that you want to use PHP on the job. I had a similar problem lately, where I had to parse around 100 static html documents, and extract parts of the information to place it in a database. Initially I thought about regular expressions, but as I went along I saw that will be a tedious task.

So I ended up messing with XPath and SimpleXML in PHP.

Here is how I ended up:

$file_contents = file_get_contents( $file );
$dom = new DOMDocument;
$dom->loadHTML( $file_contents );
$document = simplexml_import_dom( $dom );

Now I have a SimpleXML object which holds the HTML code. That is really great - here is how it rolls:

suppose you have the following html code:

<div id="content">
<div class="description">
 <dl>
     <dt>Title</dt>
     <dd>
         <ul><li> first item </li> <li> second item</li></ul>
         <p> a paragraph.. </p>
     </dd>
 </dl>
</div>
</div>

Now, you can iterate over all the <dl> elements in your code, which are children of div#description and grandchildren of div#content like that:

foreach( $document->xpath( "//div[@id='content']/div[@class='description']/dl" ) as $element )

and then all the children are parsed through a recursive function like this one:

function recurse( $parent )
{
echo '<' . $parent->getName() . '>' . "\n";
#echo $parent # you might want to strip any white spaces like \t and \n here

foreach( $parent->children() as $child )
{
    if( count( $child->children() ) > 0 )
    {
        recurse( $child );
    }
    else
    {
       echo '<' . $child->getName() . '>';
       echo $child;
       echo '</' . $child->getName() . '>' . "\n";
    }
}
echo '</' . $parent->getName() . '>' . "\n";
}

I hope that I've been of help, good luck!

Petrunov
regards your edit - yes that is true but it is part of a function that traverses the html tree structure which means it will add the information whether I want it or not, unless I specify every type of tag (node) I want it to ignore (which is annoying to me :) )
EddyR
+1  A: 

In that case perhaps a preg_match like this one might be what you need?

preg_match( '/<head (.*)>/', $file_contents, $matches );
echo ( isset( $matches[1] ) ) ? $matches[1] : '';
Petrunov
A: 

which is irrelevant because I just need the tag attributes

I am not sure where are the attributes in your example. And am no PHP xpath implementation expert.

However you may try the following:

  • use the text() xpath function at the end of your expression (e.g. "/html/head/text()") to get only the text nodes, not tags
  • the xpath function should return a NodeList. You shoud use that to get an entire fragment XML - e.g. DOMXpath does just that.
Vlagged