ansaurus

Question

PHP using preg_match to get title from article

Answer 1

A:

You may need to backslash-quote your backslashes.

PHP's string parser removes one layer of backslashes, and then the regular-expression engine consumes another layer, so (for example) recognizing a backslash requires FOUR of them in the source code.

Beyond that, you might try taking advantage of the XML recognition stuff in PHP, or do less clever string handling. Usually when REGEXes break, it's because you're trying to be too clever with them. Consider looking only for the " and remove the whole title tag, and then strip whitespace out of the string, and VOILA! A title.

See also http://php.net/manual/en/book.simplexml.php

Ian 2010-08-22 14:07:09

Answer 2

A:

Try this

if (preg_match('%(<title.*?\b(?!\w))(\n*\r*.+\n*\r*)(\b(?=\w)/title.*?\b(?!\w))%', $data, $matches)) {
    $title = $matches[1];
} else {
    $title = "";
}

droidgren 2010-08-22 14:09:50

Answer 3

+2 A:

If you still want to use regex and not DOM, here's what you can do:

if(preg_match("/<title>(.+)<\/title>/i", $data, $matches))
     print "The title is: $matches[1]";
else
     print "The page doesn't have a title tag";

shamittomar 2010-08-22 14:11:26

Thank you, this works. Guess I was just making it too complicated. ALthough not sure why it would work in the tester and not in the actuall script.

pfunc 2010-08-22 14:13:55

You're welcome. Just following the KISS principle.

shamittomar 2010-08-22 14:14:55

@pfunc, I did this (quick and dirty) and it works very fine and shows the title of the page. I guess you have to use `echo $matches[2];` to make it work. $data = file_get_contents("http://localhost/"); preg_match('#(\<title.*?\>)(\n*\r*.+\n*\r*)(\<\/title.*?\>)#', $data, $matches); echo $matches[2];

shamittomar 2010-08-22 14:19:32

Answer 4

A:

Like everyone else, this has the "use a parser, not regex" disclaimer. However, if you still want regex, look at this:

$string = "<title>I am a title</title>";
$regex = "!(<title[^>]*>)(.*)(</title>)!i";
preg_match($regex, $string, $matches);
print_r($matches);

//should output:
array(
    [1] => "<title>"
    [2] => "I am a title"
    [3] => "</title>"
)

Tim 2010-08-22 14:21:46

Answer 5

A:

Or you could use, you know, an HTML parser for HTML:

$dom = new domDocument;
$dom->loadHTML($HTML);

echo $dom->getElementsByTagName('title')->item(0)->nodeValue;

Erik 2010-08-22 14:37:50

I prefer to use SimpleHTMLDOM extention myself, but this method doesn't require an external library.

Erik 2010-08-22 14:38:40

@Erik yes, but DOMDocument is pretty strict in regards to markup validity. It won't work on many pages.

Pekka 2010-08-22 14:39:51

supress errors when you use `->loadHTML()` and you'd be surprised how well it will handle mangled HTML

Erik 2010-08-22 16:30:42

ansaurus

tags:

views:

answers:

PHP using preg_match to get title from article

related questions