I need to parse a html definition list like the following:
<dl>
<dt>stuff</dt>
<dd>junk</dd>
<dd>things</dd>
<dd>whatnot</dd>
<dt>colors</dt>
<dd>red</dd>
<dd>green</dd>
<dd>blue</dd>
</dl>
So that I can end up with an associative array like this:
[definition list] =>
[stuff] =>
[0] => junk
[1] => things
[2] => whatnot
[colors] =>
[0] => red
[1] => green
[2] => blue
I am using DOMDocument -> loadHTML()
to import the HTML string into an object and then simplexml_import_dom()
to use the simplexml extensions, specifically xpath
.
The problem I'm having is with the XPath syntax for querying all <dd>
elements that are consecutive and not broken by a <dt>
.
Since <dd>
elements are not considered children of <dt>
elements, I can't simply loop through a query all dt
s and query for all dd
s.
So I'm thinking I have to do a query for the first dd
sibling of each dt
and then all dd
siblings of that first dd
.
But I'm not clear from the XPath tutorials if this is possible. Can you say "consecutive matching siblings"? Or am I forced to loop through each child of the original dl
and move over any dt
s and dd
as they show up?