views:

107

answers:

1

I am trying to use simple_html_dom to extract some data from a website. Unfortunately somehow in the middle of the data to be analyzed it cuts off a part which means the data that I want to extract is or is not part of the string that I can analyze.

This is my code:

 <?php
 include_once('../../simple_html_dom.php');

  function mouser() {
  $html = file_get_html('http://de.mouser.com/Search/Refine.aspx?Keyword=BSP75'); 
    foreach($html->find('tr[class*=SearchResultsRow]') as $article) {
     echo $article;
     echo "\n\n\n\n\n\n --------------------------------------- \n\n\n\n\n\n\n\n";
     $item['itemno'] = trim($article->find('a[id*=MfrPartNumberLink]', 0)->plaintext);
     $return[] = $item;
    }

  $html->clear();
  unset($html);
  return $return;
 }

 $result = mouser();
 foreach($result as $article) {
  echo 'Articleno: '.$article['itemno'].'<br>';
  echo '<hr />';
 }
 ?>

Unfortunately this is the result:

   Articleno: <br><hr />
 Articleno: <br><hr />
    Articleno: <br><hr />
    Articleno: BSP752R<br><hr />
    Articleno: BSP752T<br><hr />

I then analyzed why that is and found that the $html element is cut off at some point which means the next find() cant find anything.

It cuts off somewhere at the </div></a> elements rather than waiting for the </tr>

Can anyone help me with this?

+1  A: 

PHP's built-in DOM is much better..

<?php

$d = new DomDocument;
$d->loadHtml(file_get_contents('http://gb.mouser.com/Search/Refine.aspx?Keyword=BSP75'));
$xp = new DomXpath($d);

$res = $xp->query("//tr[@class='SearchResultsRowEven' or @class='SearchResultsRowOdd']", $d);
foreach ($res as $dn) {
    var_dump($dn);
}
?>
Robin
Thanks for this suggestion but besides this error simple_html_dom seams really nice so I would love to make it work. Your sample code generates some error messages on my machine:Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: error parsing attribute name in Entity, line: 599 in C:\xampp\htdocs\domtest\example\scraping\test2.php on line 4Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Opening and ending tag mismatch: a and div in Entity, line: 599 in C:\xampp\htdocs\domtest\example\scraping\test2.php on line 4(and so on)
Timothy
Those errors are validity errors in the HTML page you're trying to leach - this is most likely the reason that simple_html_dom is getting things wrong, it's that old favourite "shit in, shit out". I'd wager that the built-in DOM is better capable of dealing with invalid markup than simple_html_dom, but if you think you can make it work with simple_html_dom then good luck!
Robin
Okay, thanks for this information. I will closely look into the built in handling. Can you say something about the error I get?
Timothy