tags:

views:

57

answers:

5

I am having a strange problem with preg_match. I am using a regular expression that grabs the title of an article, basically looks for the tag:

preg_match('#(\<title.*?\>)(\n*\r*.+\n*\r*)(\<\/title.*?\>)#', $data, $matches)

When I print out the $matches array I get nothing. But when I try the same thing in a regular expression tester, it works fine. I have even tried putting in a string that would definitely match it in place of the $data variable, without any luck.

What am I doing wrong here?

A: 

You may need to backslash-quote your backslashes.

PHP's string parser removes one layer of backslashes, and then the regular-expression engine consumes another layer, so (for example) recognizing a backslash requires FOUR of them in the source code.

Beyond that, you might try taking advantage of the XML recognition stuff in PHP, or do less clever string handling. Usually when REGEXes break, it's because you're trying to be too clever with them. Consider looking only for the " and remove the whole title tag, and then strip whitespace out of the string, and VOILA! A title.

See also http://php.net/manual/en/book.simplexml.php

Ian
A: 

Try this

if (preg_match('%(<title.*?\b(?!\w))(\n*\r*.+\n*\r*)(\b(?=\w)/title.*?\b(?!\w))%', $data, $matches)) {
    $title = $matches[1];
} else {
    $title = "";
}
droidgren
+2  A: 

If you still want to use regex and not DOM, here's what you can do:

if(preg_match("/<title>(.+)<\/title>/i", $data, $matches))
     print "The title is: $matches[1]";
else
     print "The page doesn't have a title tag";
shamittomar
Thank you, this works. Guess I was just making it too complicated. ALthough not sure why it would work in the tester and not in the actuall script.
pfunc
You're welcome. Just following the KISS principle.
shamittomar
@pfunc, I did this (quick and dirty) and it works very fine and shows the title of the page. I guess you have to use `echo $matches[2];` to make it work. $data = file_get_contents("http://localhost/"); preg_match('#(\<title.*?\>)(\n*\r*.+\n*\r*)(\<\/title.*?\>)#', $data, $matches); echo $matches[2];
shamittomar
A: 

Like everyone else, this has the "use a parser, not regex" disclaimer. However, if you still want regex, look at this:

$string = "<title>I am a title</title>";
$regex = "!(<title[^>]*>)(.*)(</title>)!i";
preg_match($regex, $string, $matches);
print_r($matches);

//should output:
array(
    [1] => "<title>"
    [2] => "I am a title"
    [3] => "</title>"
)
Tim
A: 

Or you could use, you know, an HTML parser for HTML:

$dom = new domDocument;
$dom->loadHTML($HTML);

echo $dom->getElementsByTagName('title')->item(0)->nodeValue;
Erik
I prefer to use SimpleHTMLDOM extention myself, but this method doesn't require an external library.
Erik
@Erik yes, but DOMDocument is pretty strict in regards to markup validity. It won't work on many pages.
Pekka
supress errors when you use `->loadHTML()` and you'd be surprised how well it will handle mangled HTML
Erik