views:

93

answers:

3

I want to try figure out how to get the

<title>A common title</title>
<meta name="keywords" content="Keywords blabla" />
<meta name="description" content="This is the description" />

Even though if it's arranged in any order, I've heard of the PHP Simple HTML DOM Parser but I don't really want to use it. Is it possible for a solution except using the PHP Simple HTML DOM Parser.

preg_match will not be able to do it if it's invalid HTML?

Can cURL do something like this with preg_match?

Facebook does something like this but it's properly used by using:

<meta property="og:description" content="Description blabla" />

I want something like this so that it is possible when someone posts a link, it should retrieve the title and the meta tags. If there are no meta tags, then it it ignored or the user can set it themselves (but I'll do that later on myself).

+4  A: 

Your best bet is to bite the bullet use the DOM Parser - it's the 'right way' to do it. In the long run it'll save you more time than it takes to learn how. Parsing HTML with Regex is known to be unreliable and intolerant of special cases.

Joshua
+1, with the addition that if you just use the built-in DOM extension instead of Simple HTML DOM Parser you are probably a _lot_ faster, and you don't clutter code with a 3rd party library (although, you add a requirement to your server-environment, i.e. DOM enabled, which it is by default).
Wrikken
+4  A: 

This is the way it should be:

function file_get_contents_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

$html = file_get_contents_curl("http://example.com/");

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
$title = $nodes->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
    $meta = $metas->item($i);
    if($meta->getAttribute('name') == 'description')
        $description = $meta->getAttribute('content');
    if($meta->getAttribute('name') == 'keywords')
        $keywords = $meta->getAttribute('content');
}

echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";
shamittomar
Wow...impressive
d2burke
@d2burke, Thanks.
shamittomar
A: 

If you're working with PHP, check out the Pear packages at pear.php.net and see if you find anything useful to you. I've used the RSS packages effectively and it saves a lot of time, provided you can follow how they implement their code via their examples.

Specifically take a look at Sax 3 and see if it will work for your needs. Sax 3 is no longer updated but it might be sufficient.

Geekster