views:

62

answers:

2

How can i get the blogid from a given blogspot.com url? I looked at the source code of the webpage from a blogspot.com it looks like this

<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://www.blogger.com/rsd.g?blogID=4899870735344410268" />

how can i parse this to get the number 4899870735344410268

A: 
$pageContents = file_get_contents('blospot_url');
preg_match('~<link rel="EditURI" type="application/rsd\+xml" title="RSD" href="http://www.blogger.com/rsd.g\?blogID=([0-9]+)" />~', $pageContents, $matches);
echo $matches[1];
Sam Dark
-1: Do not use `file_get_contents()` with url. `allow_url_fopen` recommended setting is off for security reasons. It is recommended best practice to keep it that way. http://phpsec.org/projects/phpsecinfo/tests/allow_url_fopen.html
Andrew Moore
Also, never parse HTML documents using regular expressions.
Andrew Moore
I know about file_get_contents(). Used it for simplicity.Why not regular expressions? HTML can be a mess and even DOMDocument sometimes can't deal with it.
Sam Dark
The only problem `DOMDocument` had was with UTF-8 documents and that has been fixed recently. Even then, this particular example won't be affected by this. There are such things as best practices. It is universally accepted that regex should not be used to parse HTML documents. The only thing `DOMDocument` fails to do in bad HTML documents is to be silent about them, which can be easily fixed using `@`.
Andrew Moore
+2  A: 

Use DOMDocument to parse the document and then use its methods to retrieve the wanted element.

I cannot stress this enough: never use regular expressions to parse an HTML document.

function getBlogId($url) {
  $ch = curl_init($url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
  $page = curl_exec ($ch);
  curl_close($ch);

  $doc = new DOMDocument();
  @$doc->loadHTML($page);

  $links = $doc->getElementsByTagName('link');

  foreach($links as $link) {
    $rel = $link->attributes->getNamedItem('rel');

    if($rel && $rel->nodeValue == 'EditURI') {
      $href = $link->attributes->getNamedItem('href')->nodeValue;
      $query = parse_url($href, PHP_URL_QUERY);

      if($query) {
        $queryComp = array();
        parse_str($query, $queryComp);

        if($queryComp['blogID']) {
          return $queryComp['blogID'];
        }
      }
    }
  }

  return false;
}

Example use:

$id = getBlogId('http://thehouseinmarrakesh.blogspot.com/');
echo $id; // 483911541311389592
Andrew Moore
Ok that will be also nice thank you
streetparade
Ok im waiting for your example :-)
streetparade
@streetparade: The example is up.
Andrew Moore
i get a lot of warningsWarning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 19 in /var/www/blogger/blogger.php on line 212
streetparade
@streetparade: You can safely ignore them. Silence them using `@`.
Andrew Moore
@streetparade: I've updated my example to reflect the proper usage of `@`.
Andrew Moore
Using @ can lead to silent errors that are really hard to hunt down. Try libxml_use_internal_errors(true) to get rid of warnings.
Sam Dark
@Sam Dark: In the present case, those silent errors will not be important. Therefore `@` is sufficient. Using `libxml_use_internal_errors(true)` affects the whole script, while `@` affects only this instance. Thus using `libxml_use_internal_errors(true)` here would cause more harm than good.
Andrew Moore