tags:

views:

31

answers:

0

I want to parse one site. Structure of the table:

inside many (if site is being updated number of is not the same). inside this we have (with class="a" and second class="b" and thrid class="c" and fourth class="n" only free space) and (with class="d").

Example with this structure:

<table>
<tbody>
<tr>
<td class="a"></td>
</tr>
<tr>
<th class="d"></th>
<th class="d"></th>
<th class="d"></th>
<th class="d"></th>
</tr>
<tr>
<td class="b"></td>
<td class="b"></td>
<td class="b"></td>
<td class="b"></td>
</tr>
<tr>
<td class="n"></td>
</tr>
<tr>
<td class="a"></td>
</tr>
<tr>
<th class="d"></th>
<th class="d"></th>
<th class="d"></th>
<th class="d"></th>
</tr>
<tr>
<td class="b"></td>
<td class="b"></td>
<td class="b"></td>
<td class="b"></td>
</tr>
<tr>
<td class="c"></td>
<td class="c"></td>
<td class="c"></td>
<td class="c"></td>
</tr>
<tr>
<td class="n"></td>
</tr>
...
</tbody>
</table>

We always have 1 inside and 4 inside another and 4 inside another than sometimes we have 4 inside another or 1 inside another

I started with curl, dom and xpath:

<?php 
function get_page($url)
{
    $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    $ch=curl_init();
    $timeout = 60;
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_HEADER, 0);
    curl_setopt($ch,CURLOPT_FOLLOWLOACITION, TRUE);
    curl_setopt($ch,CURLOPT_MAXREDIRS, 10);
    curl_setopt($ch,CURLOPT_ENCODING, "");
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
    $data=curl_exec($ch);
    curl_close($ch);
    return $data;
    }
$html=get_page('http://www.donbest.com/wnba/injuries/');
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

but than I have problems how to put this data into my table. If I'll have the same number of than this won't be problem, but it is changing...