ansaurus

Question

Parsing BlogId from Blogspot.com in PHP using Regex

Answer 1

A:

$pageContents = file_get_contents('blospot_url');
preg_match('~<link rel="EditURI" type="application/rsd\+xml" title="RSD" href="http://www.blogger.com/rsd.g\?blogID=([0-9]+)" />~', $pageContents, $matches);
echo $matches[1];

Sam Dark 2010-02-16 19:18:28

-1: Do not use `file_get_contents()` with url. `allow_url_fopen` recommended setting is off for security reasons. It is recommended best practice to keep it that way. http://phpsec.org/projects/phpsecinfo/tests/allow_url_fopen.html

Andrew Moore 2010-02-16 19:22:23

Also, never parse HTML documents using regular expressions.

Andrew Moore 2010-02-16 19:27:50

I know about file_get_contents(). Used it for simplicity.Why not regular expressions? HTML can be a mess and even DOMDocument sometimes can't deal with it.

Sam Dark 2010-02-16 19:50:21

The only problem `DOMDocument` had was with UTF-8 documents and that has been fixed recently. Even then, this particular example won't be affected by this. There are such things as best practices. It is universally accepted that regex should not be used to parse HTML documents. The only thing `DOMDocument` fails to do in bad HTML documents is to be silent about them, which can be easily fixed using `@`.

Andrew Moore 2010-02-16 19:53:19

Answer 2

+2 A:

Use DOMDocument to parse the document and then use its methods to retrieve the wanted element.

I cannot stress this enough: never use regular expressions to parse an HTML document.

function getBlogId($url) {
  $ch = curl_init($url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
  $page = curl_exec ($ch);
  curl_close($ch);

  $doc = new DOMDocument();
  @$doc->loadHTML($page);

  $links = $doc->getElementsByTagName('link');

  foreach($links as $link) {
    $rel = $link->attributes->getNamedItem('rel');

    if($rel && $rel->nodeValue == 'EditURI') {
      $href = $link->attributes->getNamedItem('href')->nodeValue;
      $query = parse_url($href, PHP_URL_QUERY);

      if($query) {
        $queryComp = array();
        parse_str($query, $queryComp);

        if($queryComp['blogID']) {
          return $queryComp['blogID'];
        }
      }
    }
  }

  return false;
}

Example use:

$id = getBlogId('http://thehouseinmarrakesh.blogspot.com/');
echo $id; // 483911541311389592

Andrew Moore 2010-02-16 19:21:05

Ok that will be also nice thank you

streetparade 2010-02-16 19:21:45

Ok im waiting for your example :-)

streetparade 2010-02-16 19:25:05

@streetparade: The example is up.

Andrew Moore 2010-02-16 19:28:43

i get a lot of warningsWarning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 19 in /var/www/blogger/blogger.php on line 212

streetparade 2010-02-16 19:30:31

@streetparade: You can safely ignore them. Silence them using `@`.

Andrew Moore 2010-02-16 19:35:41

@streetparade: I've updated my example to reflect the proper usage of `@`.

Andrew Moore 2010-02-16 19:47:50

Using @ can lead to silent errors that are really hard to hunt down. Try libxml_use_internal_errors(true) to get rid of warnings.

Sam Dark 2010-02-16 19:51:47

@Sam Dark: In the present case, those silent errors will not be important. Therefore `@` is sufficient. Using `libxml_use_internal_errors(true)` affects the whole script, while `@` affects only this instance. Thus using `libxml_use_internal_errors(true)` here would cause more harm than good.

Andrew Moore 2010-02-16 19:57:39

ansaurus

tags:

views:

answers:

Parsing BlogId from Blogspot.com in PHP using Regex

related questions