ansaurus

Question

Fastest way to retrieve a <title> in PHP

Answer 1

+4 A:

Regex?

Use cURL to get the $htmlSource variable's contents.

preg_match('/<title>(.*)<\/title>/iU', $htmlSource, $titleMatches);

print_r($titleMatches);

see what you have in that array.

Most people say for HTML traversing though you should use a parser as regexs can be unreliable.

The other answers provide more detail :)

alex 2008-12-30 02:07:04

perhaps it should be changed to non greedy to make it safer

alexeit 2008-12-30 02:11:05

But how do I get hold of $htmlSource?

2008-12-30 02:11:43

In this case I think it can be safely assumed that a parser would be overkill. /agree on the non-greedy matching

Will Bickford 2008-12-30 02:13:07

You can grab $htmlSource with curl or fopen.

Will Bickford 2008-12-30 02:13:41

i made some edits.. thanks for the input guys

alex 2008-12-30 02:26:29

I was looking for a better way to do that, but looks like most people use your proposed solution as a fast method to retrieve the title.Please consider using the 's' modifier, i've seen weird situations where a new line breaks the regex

rmontagud 2009-11-17 11:55:59

Answer 2

+12 A:

<?php
    function page_title($url) {
        $fp = file_get_contents($url);
        if (!$fp) 
            return null;

        $res = preg_match("/<title>(.*)<\/title>/", $fp, $title_matches);
        if (!$res) 
            return null; 

        $title = $title_matches[1];
        return $title;
    }
?>

Gave 'er a whirl on the following input:

print page_title("http://www.google.com/");

Outputted: Google

Hopefully general enough for your usage. If you need something more powerful, it might not hurt to invest a bit of time into researching HTML parsers.

EDIT: Added a bit of error checking. Kind of rushed the first version out, sorry.

Ed Carrel 2008-12-30 02:15:34

Great! That's good.

2008-12-30 02:26:18

I'm relatively sure that will produce an error if the pattern isn't found. Initialise $title first, assign preg_match() to a boolean and check for that before attempting to access the first element of the $title_matches array.

scronide 2009-01-02 19:46:45

Oh. Too right. If preg_match doesn't get a result, the reference to $title_matches will barf. Will tidy up a bit.

Ed Carrel 2009-01-07 01:12:07

Answer 3

+5 A:

or making this simple function slightly more bullet proof:

function page_title($url) {

    $page = @file_get_contents($url);

    if (!$page) return null;

    $matches = array();

    if (preg_match('/<title>(.*?)<\/title>/', $page, $matches)) {
        return $matches[1];
    }
    else {
        return null;
    }
}


echo page_title('http://google.com');

alexeit 2008-12-30 02:23:51

Yeah, I got caught once by a page with two title tags. Add the question mark after the asterisk.

AmbroseChapel 2008-12-30 12:06:59

Answer 4

+1 A:

I like using SimpleXml with regex's, this is from a solution I use to grab multiple link headers from a page in an OpenID library I've created. I've adapted it to work with the title (even though there is usually only one).

function getTitle($sFile)
{
    $sData = file_get_contents($sFile);

    if(preg_match('/<head.[^>]*>.*<\/head>/is', $sData, $aHead))
    {   
        $sDataHtml = preg_replace('/<(.[^>]*)>/i', strtolower('<$1>'), $aHead[0]);
        $xTitle = simplexml_import_dom(DomDocument::LoadHtml($sDataHtml));

        return (string)$xTitle->head->title;
    }
    return null;
}

echo getTitle('http://stackoverflow.com/questions/399332/fastest-way-to-retrieve-a-title-in-php');

Ironically this page has a "title tag" in the title tag which is what sometime causes problems with the pure regex solutions.

This solution is not perfect as it lowercase's the tags which could cause a problem for the nested tag if formatting/case was important (such as XML), but there are ways that are a bit more involved around that problem.

null 2008-12-31 08:09:28

ansaurus

tags:

views:

answers:

Fastest way to retrieve a <title> in PHP

related questions