views:

215

answers:

3

I am trying to write a web scraper. I want to get all the cells in a row. The row before the one I want has THOROUGHBRED MEETINGS as its plain text value. I can successfully get this row. But I can't figure out how to get the next row's children which are the cells or <td> tags.

if ($foundTag = FindTagByText("THOROUGHBRED MEETINGS", $html))
{
    $cell = $foundTag->parent();
    $row = $cell->parent();
    $nextRow = $row->next_sibling();
    echo "Row: ".$row->plaintext."<br />\n";
    echo "Next Row: ".$nextRow->plaintext."<br />\n";
    $cells = $nextRow->children();

    foreach ($cells as $cell)
    {
        echo "Cell: ".$cell->plaintext."<br />\n";
    }
}

function FindTagByText($text, $html)
{
    // Use Simple_HTML_DOM special selector 'text'
    // to retrieve all text nodes from the document
    $textNodes = $html->find('text');
    $foundTag = null;

    foreach($textNodes as $textNode) 
    {
        if($textNode->plaintext == $text) 
        {
            // Get the parent of the text node
            // (A text node is always a child of
            //  its container)
            $foundTag = $textNode->parent();
            break;
        }
    }

    return $foundTag;
}

Here is the html I am trying to parse:

<tr valign=top>
<td colspan=16 bgcolor=#999999><b>THOROUGHBRED MEETINGS</b></td>

</tr>
<tr valign=top bgcolor="#ffffff">
<td><b>BR</b> <a href="meeting?mtg=br&day=today&curtype=0">SUNSHINE COAST</a></td>
<td>FINE/DEAD</b></td>
<td><font color=#cc0000><b>R1</b></font>@<b>12:30pm</b></td>
<td align=center bgcolor=#cc0000><a href="odds?mting=BR01000"><b><font color=#ffffff>1</a></font></td>
<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</b></font></a></td>
<td align=center><a href="odds?mting=BR03000"><b><font color=black>3</b></font></a></td>

<td align=center><a href="odds?mting=BR04000"><b><font color=black>4</b></font></a></td>
<td align=center><a href="odds?mting=BR05000"><b><font color=black>5</b></font></a></td>
<td align=center><a href="odds?mting=BR06000"><b><font color=black>6</b></font></a></td>
<td align=center><a href="odds?mting=BR07000"><b><font color=black>7</b></font></a></td>
<td align=center><a href="odds?mting=BR08000"><b><font color=black>8</b></font></a></td>
<td bgcolor="#ffffff" colspan=4>&nbsp;</td>
</tr>

Here is my output:

Row: THOROUGHBRED MEETINGS
Next Row: BR SUNSHINE COAST FINE/DEAD R1@12:30pm 1 2 3 4 5 6 7 8   CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9   DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7   MR CRANBOURNE OCAST/SLOW R1@12:20pm 1 2 3 4 5 6 7 8   NR COFFS HARBOUR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 7 8   SR MORUYA FINE/GOOD R1@12:25pm 1 2 3 4 5 6 7 8   VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8   XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 4 5 6 7     HARNESS MEETINGS DT LAUNCESTON SHWRY/GOOD R1@ 4:57pm 1 2 3 4 5 6 7 8 9 10   MT CRANBOURNE OCAST/GOOD R1@ 5:05pm 1 2 3 4 5 6 7 8     GREYHOUND MEETINGS AD GAWLER OCAST/GOOD R1@ 5:10pm 1 2 3 4 5 6 7 8 9 10 11   CD CANBERRA OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11   MD SALE FINE/GOOD R1@ 4:54pm 1 2 3 4 5 6 7 8 9 10 11 12
Cell: BR SUNSHINE COAST
Cell: FINE/DEAD
Cell: R1@12:30pm
Cell: 1 2 3 4 5 6 7 8   CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9   DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7   MR CRANBOURNE OCAST/SLOW R1@12:20pm 1 2 3 4 5 6 7 8   NR COFFS HARBOUR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 7 8   SR MORUYA FINE/GOOD R1@12:25pm 1 2 3 4 5 6 7 8   VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8   XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 4 5 6 7     HARNESS MEETINGS DT LAUNCESTON SHWRY/GOOD R1@ 4:57pm 1 2 3 4 5 6 7 8 9 10   MT CRANBOURNE OCAST/GOOD R1@ 5:05pm 1 2 3 4 5 6 7 8     GREYHOUND MEETINGS AD GAWLER OCAST/GOOD R1@ 5:10pm 1 2 3 4 5 6 7 8 9 10 11   CD CANBERRA OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11   MD SALE FINE/GOOD R1@ 4:54pm 1 2 3 4 5 6 7 8 9 10 11 12 
A: 

You'll get the first td like this:

$firstTD = $row->first_child();

After that you can get the subsequent ones with:

$firstTD->next_sibling()
Wouter van Nifterick
Fatal error: Call to undefined method simple_html_dom_node::child_nodes() in /var/www/php.php on line 37
Fatal error: Call to undefined method simple_html_dom_node::domnode_next_sibling() in /var/www/php.php on line 37
sorry.. it's `$firstTD->next_sibling();`
Wouter van Nifterick
I still getting the same problem with that code. It just mashes all the siblings up into one field. It is not seperating the `<td>` tags
+2  A: 

You will not like my answer.

Unfortunately, it seems that mismatched closing tags in the HTML you are parsing are confusing Simple_HTML_DOM. Take a look at this snippet:

<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</b></font></a></td>

If you follow the order of tags of this snippet:

  • <td> is opened
  • <a> is opened
  • <b> is opened
  • <font> is opened

Technically, tags should be closed in the opposite order, but this is how they are closed:

  • </b> is closed
  • </font> is closed
  • </a> is closed
  • </td> is closed

The HTML you are trying to scarp is full of those mistakes, all well as closing tags for tags which are never opened. Simple_HTML_DOM doesn't parse those files properly.

I'm afraid that if you don't have the possibility of modifying the HTML, you'll have to parse the file manually, correcting any errors.


As a note, I've tested your code against the following corrected HTML, and Simple_HTML_DOM parsed it successfully, and your code worked just fine.

<tr valign=top>
<td colspan=16 bgcolor=#999999><b>THOROUGHBRED MEETINGS</b></td>

</tr>
<tr valign=top bgcolor="#ffffff">
<td><b>BR</b> <a href="meeting?mtg=br&day=today&curtype=0">SUNSHINE COAST</a></td>
<td><b>FINE/DEAD</b></td>
<td><font color=#cc0000><b>R1</font></b>@<b>12:30pm</b></td>
<td align=center bgcolor=#cc0000><a href="odds?mting=BR01000"><b><font color=#ffffff>1</a></b></font></td>
<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</font></b></a></td>
<td align=center><a href="odds?mting=BR03000"><b><font color=black>3</font></b></a></td>

<td align=center><a href="odds?mting=BR04000"><b><font color=black>4</font></b></a></td>
<td align=center><a href="odds?mting=BR05000"><b><font color=black>5</font></b></a></td>
<td align=center><a href="odds?mting=BR06000"><b><font color=black>6</font></b></a></td>
<td align=center><a href="odds?mting=BR07000"><b><font color=black>7</font></b></a></td>
<td align=center><a href="odds?mting=BR08000"><b><font color=black>8</font></b></a></td>
<td bgcolor="#ffffff" colspan=4> </td>
</tr>


Edit: As an alternative, you might want to try if DOMDocument::loadHTML has better results. It is available in PHP 5 without external libraries. Check the official documentation.

Andrew Moore
How do I parse the file manually?
Proper HTML parsing is a rather complicated subject. I'm afraid I can't help you with that.
Andrew Moore
I added another alternative.
Andrew Moore
1+ for spotting the invalid html. I didn't notice that. Glen, I think you should either accept the fact that invalid syntax just cannot be parsed properly. Or if you really need to parse this page, just hardcode something. If you first remove all <b> and </b> tags, you should be able to parse the remainder.
Wouter van Nifterick
**@Wouter van Nifterick:** Should... We don't know the rest of the page and how it might affect parsing. But for this snippet, it is a viable solution.
Andrew Moore
A: 

I got it to work by putting into a DOMDocument() to correct the malformed HTML.

$url = "http://www.acttab.com.au/interbet/venues?day=today";

$doc = new DOMDocument();
$doc->loadHTMLFile($url);

//convert $doc to html
$html = str_get_html($doc->saveHTML());