views:

121

answers:

2
+2  Q: 

Screen Scraping

Hi I'm trying to implement a screen scraping scenario on my website and have the following set so far. What I'm ultimately trying to do is replace all links in the $results variable that have "ResultsDetails.aspx?" to "results-scrape-details/" then output again. Can anyone point me in the right direction?

<?php 
$url = "http://mysite:90/Testing/label/stuff/ResultsIndex.aspx";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,"<div id='pageBack'");
$end = strpos($content,'</body>',$start) + 6;
$results = substr($content,$start,$end-$start);
$pattern = 'ResultsDetails.aspx?';
$replacement = 'results-scrape-details/';
preg_replace($pattern, $replacement, $results);
echo $results;
+8  A: 

Use a DOM tool like PHP Simple HTML DOM. With it you can find all the links you're looking for with a Jqueryish syntax.

// Create DOM object from HTML source
$dom = file_get_html('http://www.domain.com/path/to/page');
// Iterate all matching links
foreach ($dom->find('a[href^=ResultsDetails.aspx') as $node) {
    // Replace href attribute value
    $node->href = 'results-scrape-detail/';
}
// Output modified DOM
echo $dom->outertext;
nikc
faster then me -- deleting my answer. Although I will note that he may want to use `->find('a[href*=...');` which means 'contains' rather then 'starts with' depending on where that string appears in the HREF value. Also, there's no ->outerhtml method, just outertext (I corrected in your example)
Erik
Oops, too early in the morning, thanks for the edit :-)
nikc
This will only replace links with the non-prefixed, relative URL hardcoded.
symcbean
Perfecting the algorithm is up to the final user, as with any homework. I merely provide the mechanism.
nikc
A: 

The ? char has special meaning in regexes - either escape it and use the same code or replace the preg_replace with str_ireplace() (I'd recommend the latter approach as it is also more efficient).

(and should the html_entity_decode call really be there?)

C.

symcbean