views:

81

answers:

2

Hi,

Have been trying several days to parse the following html code (notice that there is not a real hierarchal tree structure). Everything is pretty much on the same level.

<p><span class='one'>week number</span></p>

<p><span class='two'>day of the week</span></p>
<table class='spreadsheet'>
table data
</table>

<p><span class='two'>another day of the week</span></p>
<table class='spreadsheet'>
table data
</table>

<p><span class='one'>another week number</span></p>
ETC

What I basically want to do is, to go through each dom element, check whether it is a week, if it is, add all the days of the week to that specific week, and add all the table data to the corresponding day of the week. So something of the following structure:

array {
31 => array {
    monday => array {
        data => table data
    }
    tuesday => array {
        data => table data
    }   
}

32 => array {
    monday => array {
        data => table data
    }
    tuesday => array {
        data => table data
    }   
}
}

This is my PHP code that I have so far.

$d = new DomDocument;
@$d->loadHtml($html);
$xp = new DomXpath($d);

$res = $xp->query( "//*[@class='one' or @class='two' or @class='spreadsheet']" ); 

foreach ($res as $dn) {
    $nodes = $dn->childNodes;
    foreach ($nodes as $node) {
        if ($node->nodeValue != "") {
            echo $node->nodeValue;
        }

    }
}

I have been tipped by some people here at stackoverflow to use Xpath in order to achieve this, the above code handles each node separately. What I think I need to be doing is get all the "week" nodes, and than get their next sibling, check from there wether it is a day, if so add this to that array, if it is a "week" node, create a new array etc etc

I have been tearing my hair out the past few days with this, so any help/push in the right direction would be very much appreciated!!!

Cheers, Dandoen

+1  A: 

Updated; see below.

It would help if you would tell us what the output is of the code you've tried so far. That would help us know what already works and what's still broken. However, here's what I see looking at your use of XPath and DOM. (Disclaimer: my expertise is in XPath and DOM, not PHP.)

$res = $xp->query( "//*[@class='one' or @class='two' or @class='spreadsheet']" ); 

This XPath query will give you all the <span> and <table> nodes in your sample, because those are the elements that have the classes you asked for.

foreach ($res as $dn) {

Iterating over the span and table elements. Inside this loop is where you probably want to say if ($dn->getAttribute("class") == "one") ... and if so start a new week in your array structure; if the class is "two", add a new week day to your current week, etc.

$nodes = $dn->childNodes;

Here you're asking for the child nodes of the current span or table element. For the span, the only child node you've shown is a text node such as "another day of the week". For the table element, we assume there are tr elements etc.

foreach ($nodes as $node) {

Iterating over the single text node in a span (or child elements of a table):

    if ($node->nodeValue != "") {
        echo $node->nodeValue;
    }

Print the text content of a text node (child of a span element); or 'null' if we're looking at an element (like the tr child of a table).

So that's what the above code seems to be doing. If it's not behaving as described, post info about the actual output and we may be able to help. If it's behaving as described but you need help with the part about creating week array elements, let us know that.

Update:

I would suggest that you use this XPath query:

$weeks = $xp->query( "//*[@class='one']" ); 

to get the week number nodes. Then iterate over them:

foreach ($weeks as $week) {
    $weekNum = $week->firstChild->nodeValue;

This gets the week number out of the first child (a text node) of the week span.

Create an array entry for the new week. Then select the potential week day nodes for that week:

$spans = $xp->query( "following::span[@class='one' or @class='two']", $week );

The second argument to $xp->query() is the context node, from which the following:: axis begins.

Iterate over those:

foreach ($spans as $span) {

When you get to another week, stop:

    if ($span->getAttribute("class") == "one") break;

Otherwise double-check that it's a weekday:

    if ($span->getAttribute("class") == "two") {

then add the new weekday to your array. To get the table data (fixed a mistake):

        $table = $xp->query("following-sibling::table[1]", $span->parentNode);

Update: To get at the table data, you'll want to set up more loops like the above. Something like:

    $rows = $xp->query("tr", $table);

to get the table rows. Then iterate through those with foreach, and within those,

    $cells = $xp->query("td", $row);

And when you iterate through cells, your data will be

    $cell->firstChild->nodeValue

i.e. the text of the child text node. Note this won't work properly if you have elements inside the <td> cells.

If you need help with creating and populating arrays in PHP, I'm not the person to advise you on that as I'm not a PHP developer.

Note this is all untested. HTH.

LarsH
Lars, thanks for your answer. You are right, the code is not behaving as described, the reason why I decided to post this code is because I used it as a starting point to get what I described and now I have come to the conclusion that it's not even getting close.The current code outputs each nodeValue (of the span with 'two' as well as 'one' as class) and the table data as well.What I am basically trying to achieve is: getting all the nodevalues of the span with classes one, and from there get the next siblings, check whether the next sibling is a day or a week, and create or add to array.
dandoen
@Dandoen: so the code is basically behaving as I described in my answer, except that it's also outputting the table data? What is the table data, is it elements like `tr`, or is it text?
LarsH
Yes it is indeed behaving as you described and the table elements are tr. I just posted the shortened version of the actual code.
dandoen
@dandoen, I just posted a big update. Make sure you refresh again after reading this comment, because I had a mistake in it at first.
LarsH
Lars, thank you soooooo much!!! that's exactly what I wanted!!!
dandoen
As for getting the table data, what would be best to get that data in an array, as right now it is returning as DOMText? The table's structure is like <tr><td>data</td><td>data</td></tr><tr><td>more data</td></tr> Again, thank you ever soo much, I have been struggling with this for the past few days..
dandoen
Shouldn't $table->nodeName; output <table> because it is resulting in #text, and $table->nodeValue; is outputting nothing.
dandoen
@dandoen, I will update my answer to address these comments. You're right about $table - the nextSibling after the span must be a text node.
LarsH
Lars, once again you are the best. All works just fine now! Small question (last one, I promise) how do I get just one cell without having to use foreach loops. Because I don't need to go through every cell in the table. In an array it would be something like $cells[5]. I tried searching for nthChild, but could not find any tutorials out there. Once again, thank you dude.
dandoen
@dandoen, if you're wanting the 5th `td` in a given row, you should be able to use `$cell = $xp->query("td[5]", $row);`. If you only want one cell from the whole table, you could use `$cell = $xp->query("tr[1]/td[5]", $table);`.
LarsH
by doing so $cell->firstChild->nodeValue or $cell->nodeValue is outputting nothing. how could that be?
dandoen
$cell->item(0)->nodeValue; worked! While xPath and dom remain as a big mystery for me, I would like to thank you a lot for all your help!
dandoen
@dandoen: ah... apparently the `$cell = $xp->query("td[5]", $row)` produced a nodelist (of one node), so you had to get first node using item(0). But I don't see why its nodeValue would work... I would have expected $cell->item(0)->firstChild->nodeValue. Unless you used a different XPath query, like "td[5]/text()". Anyway, glad it worked.
LarsH
Didn't use text() in query, but it still worked.. ah well :) it worked, phanks!
dandoen
A: 

Other approach, with this input:

<html>
    <p>
        <span class='one'>week number</span>
    </p>
    <p>
        <span class='two'>day of the week</span>
    </p>
    <table class='spreadsheet'>
        <tr>
            <td>Some data</td>
        </tr>
    </table>
    <p>
        <span class='two'>another day of the week</span>
    </p>
    <table class='spreadsheet'>
        <tr>
            <td>Other data</td>
        </tr>
    </table>
    <p>
        <span class='one'>another week number</span>
    </p>
</html>

This stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:key name="kWeekByNumber" match="span[@class='one']" use="."/>
    <xsl:key name="kDayByWeek" match="span[@class='two']"
             use="generate-id(preceding::span[@class='one'][1])"/>
    <xsl:template match="text()"/>
    <xsl:template match="html">
        <weeks>
            <xsl:apply-templates/>
        </weeks>
    </xsl:template>
    <xsl:template match="span[@class='one']
                             [count(.|key('kWeekByNumber',.)[1])=1]">
        <week number="{.}">
            <xsl:apply-templates select="key('kDayByWeek',generate-id())"
                                     mode="days"/>
        </week>
    </xsl:template>
    <xsl:template match="span[@class='two']" mode="days">
        <day number="{.}">
            <xsl:copy-of select="following::table[1]"/>
        </day>
    </xsl:template>
</xsl:stylesheet>

Output:

<weeks>
    <week number="week number">
        <day number="day of the week">
            <table class="spreadsheet">
                <tr>
                    <td>Some data</td>
                </tr>
            </table>
        </day>
        <day number="another day of the week">
            <table class="spreadsheet">
                <tr>
                    <td>Other data</td>
                </tr>
            </table>
        </day>
    </week>
    <week number="another week number"></week>
</weeks>

Note: Maybe you could parse that output with SimpleXML to get an Array...

Alejandro