views:

40

answers:

1

I'm trying to make a script that will load a desired URL (as entered by user) and check if that page links back to my domain before their domain is published on my site. I'm not very experienced with regular expressions and this is what I have so far:

$loaded = file_get_contents('http://localhost/small_script/page.php');
// $loaded will be equal to the users site they have submitted
$current_site = 'site2.com';
// $current_site is the domain of my site, this the the URL that must be found in target site
$matches = Array();
$find = preg_match_all('/<a(.*?)href=[\'"](.*?)[\'"](.*?)\b[^>]*>(.*?)<\/a>/i', $loaded, $matches);

$c = count($matches[0]);
$z = 0;
while($z<$c){
  $full_link = $matches[0][$z];
  $href = $matches[2][$z];
  $z++;

  $check = strpos($href,$current_site);
    if($check === false) {

    }else{
    // The link cannot have the "no follow" tag, this is to check if it does and if so, return a specific error
    $pos = strpos($full_link,'no follow');

    if($pos === false) {
     echo $href;
    }
      else {
    //echo "rel=no follow FOUND";
    }

    }

}

As you can see, it's pretty messy and I'm entirely sure where it's headed. I was hoping someone could give me a small, fast and concise script that would do exactly what I've attempted.

  1. Load specified URL as entered by user
  2. Check if specified URL links back to my site (if not, return error code #1)
  3. If link is there, check for 'no follow', if found return error code #2
  4. If everything is OK, set a variable to true, so I can continue with other functions (like displaying their link on my page)
A: 

this is the code :) helped by http://www.merchantos.com/makebeta/php/scraping-links-with-php/

<?php

$my_url = 'http://online.bulsam.net';
$target_url = 'http://www.bulsam.net';
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);
    exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);


// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

// find result
$result = is_my_link_there($hrefs, $my_url);

if ($result == 1) {

    echo 'There is no link!!!';
} elseif ($result == 2) {

    echo 'There is, but it is NO FOLLOW !!!';
} else {

    // blah blah blah
}

// used functions

function is_my_link_there($hrefs, $my_url) {

    for ($i = 0; $i < $hrefs->length; $i++) {

        $href = $hrefs->item($i);

        $url = $href->getAttribute('href');

        if ($my_url == $url) {

            $rel = $href->getAttribute('rel');

            if ($rel == 'nofollow') {

                return 2;
            }

            return 3;
        } 
    }

    return 1;
}
Irfan EVRENS