EDIT:
About the head element - you want to get only the attributes of the head element, you can use xpath( "//head" ) and then $head->attributes.
I won't directly answer your question which is not very full of details, but I will rather tell a story about my own experience. I believe that you can solve your problems if you understand the implications of the examples I am giving.
I understand from the tags that you want to use PHP on the job. I had a similar problem lately, where I had to parse around 100 static html documents, and extract parts of the information to place it in a database. Initially I thought about regular expressions, but as I went along I saw that will be a tedious task.
So I ended up messing with XPath and SimpleXML in PHP.
Here is how I ended up:
$file_contents = file_get_contents( $file );
$dom = new DOMDocument;
$dom->loadHTML( $file_contents );
$document = simplexml_import_dom( $dom );
Now I have a SimpleXML object which holds the HTML code. That is really great - here is how it rolls:
suppose you have the following html code:
<div id="content">
<div class="description">
<dl>
<dt>Title</dt>
<dd>
<ul><li> first item </li> <li> second item</li></ul>
<p> a paragraph.. </p>
</dd>
</dl>
</div>
</div>
Now, you can iterate over all the <dl> elements in your code, which are children of div#description and grandchildren of div#content like that:
foreach( $document->xpath( "//div[@id='content']/div[@class='description']/dl" ) as $element )
and then all the children are parsed through a recursive function like this one:
function recurse( $parent )
{
echo '<' . $parent->getName() . '>' . "\n";
#echo $parent # you might want to strip any white spaces like \t and \n here
foreach( $parent->children() as $child )
{
if( count( $child->children() ) > 0 )
{
recurse( $child );
}
else
{
echo '<' . $child->getName() . '>';
echo $child;
echo '</' . $child->getName() . '>' . "\n";
}
}
echo '</' . $parent->getName() . '>' . "\n";
}
I hope that I've been of help, good luck!