tags:

views:

1052

answers:

4

I'm doing a bookmarking-system and looking for the fastest (easiest) way to retrive a page 's title with PHP.

It would be nice to have something like $title = page_title($url)

Thanks in advance! =)

+4  A: 

Regex?

Use cURL to get the $htmlSource variable's contents.

preg_match('/<title>(.*)<\/title>/iU', $htmlSource, $titleMatches);

print_r($titleMatches);

see what you have in that array.

Most people say for HTML traversing though you should use a parser as regexs can be unreliable.

The other answers provide more detail :)

alex
perhaps it should be changed to non greedy to make it safer
alexeit
But how do I get hold of $htmlSource?
In this case I think it can be safely assumed that a parser would be overkill. /agree on the non-greedy matching
Will Bickford
You can grab $htmlSource with curl or fopen.
Will Bickford
i made some edits.. thanks for the input guys
alex
I was looking for a better way to do that, but looks like most people use your proposed solution as a fast method to retrieve the title.Please consider using the 's' modifier, i've seen weird situations where a new line breaks the regex
rmontagud
+12  A: 
<?php
    function page_title($url) {
        $fp = file_get_contents($url);
        if (!$fp) 
            return null;

        $res = preg_match("/<title>(.*)<\/title>/", $fp, $title_matches);
        if (!$res) 
            return null; 

        $title = $title_matches[1];
        return $title;
    }
?>

Gave 'er a whirl on the following input:

print page_title("http://www.google.com/");

Outputted: Google

Hopefully general enough for your usage. If you need something more powerful, it might not hurt to invest a bit of time into researching HTML parsers.

EDIT: Added a bit of error checking. Kind of rushed the first version out, sorry.

Ed Carrel
Great! That's good.
I'm relatively sure that will produce an error if the pattern isn't found. Initialise $title first, assign preg_match() to a boolean and check for that before attempting to access the first element of the $title_matches array.
scronide
Oh. Too right. If preg_match doesn't get a result, the reference to $title_matches will barf. Will tidy up a bit.
Ed Carrel
+5  A: 

or making this simple function slightly more bullet proof:

function page_title($url) {

    $page = @file_get_contents($url);

    if (!$page) return null;

    $matches = array();

    if (preg_match('/<title>(.*?)<\/title>/', $page, $matches)) {
        return $matches[1];
    }
    else {
        return null;
    }
}


echo page_title('http://google.com');
alexeit
Yeah, I got caught once by a page with two title tags. Add the question mark after the asterisk.
AmbroseChapel
+1  A: 

I like using SimpleXml with regex's, this is from a solution I use to grab multiple link headers from a page in an OpenID library I've created. I've adapted it to work with the title (even though there is usually only one).

function getTitle($sFile)
{
    $sData = file_get_contents($sFile);

    if(preg_match('/<head.[^>]*>.*<\/head>/is', $sData, $aHead))
    {   
        $sDataHtml = preg_replace('/<(.[^>]*)>/i', strtolower('<$1>'), $aHead[0]);
        $xTitle = simplexml_import_dom(DomDocument::LoadHtml($sDataHtml));

        return (string)$xTitle->head->title;
    }
    return null;
}

echo getTitle('http://stackoverflow.com/questions/399332/fastest-way-to-retrieve-a-title-in-php');

Ironically this page has a "title tag" in the title tag which is what sometime causes problems with the pure regex solutions.

This solution is not perfect as it lowercase's the tags which could cause a problem for the nested tag if formatting/case was important (such as XML), but there are ways that are a bit more involved around that problem.

null