I want to parse one site. Structure of the table:
inside many (if site is being updated number of is not the same). inside this we have (with class="a" and second class="b" and thrid class="c" and fourth class="n" only free space) and (with class="d").
Example with this structure:
<table>
<tbody>
<tr>
<td class="a"></td>
</tr>
<tr>
<th class="d"></th>
<th class="d"></th>
<th class="d"></th>
<th class="d"></th>
</tr>
<tr>
<td class="b"></td>
<td class="b"></td>
<td class="b"></td>
<td class="b"></td>
</tr>
<tr>
<td class="n"></td>
</tr>
<tr>
<td class="a"></td>
</tr>
<tr>
<th class="d"></th>
<th class="d"></th>
<th class="d"></th>
<th class="d"></th>
</tr>
<tr>
<td class="b"></td>
<td class="b"></td>
<td class="b"></td>
<td class="b"></td>
</tr>
<tr>
<td class="c"></td>
<td class="c"></td>
<td class="c"></td>
<td class="c"></td>
</tr>
<tr>
<td class="n"></td>
</tr>
...
</tbody>
</table>
We always have 1 inside and 4 inside another and 4 inside another than sometimes we have 4 inside another or 1 inside another
I started with curl, dom and xpath:
<?php
function get_page($url)
{
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch=curl_init();
$timeout = 60;
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_FOLLOWLOACITION, TRUE);
curl_setopt($ch,CURLOPT_MAXREDIRS, 10);
curl_setopt($ch,CURLOPT_ENCODING, "");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data=curl_exec($ch);
curl_close($ch);
return $data;
}
$html=get_page('http://www.donbest.com/wnba/injuries/');
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
but than I have problems how to put this data into my table. If I'll have the same number of than this won't be problem, but it is changing...