tags:

views:

223

answers:

6

I'm trying to obtain the keywords from an HTML page that I'm scraping with PHP.

So, if the keywords tag looks like this:

<meta name="Keywords" content="MacUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary">

I want to get this back:

MacUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary

I've constructed a regex, but it's not doing the trick.

(?i)^(<meta name=\"keywords\" content=\"(.*)\">)

Any ideas?

A: 

(?i)<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">

Would produce something like:

preg_match('~<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">~i', $html, &$matches);
JoostK
A: 

This is a simple regex, that matches the first meta keywords tag. It only allows characters, numbers, legal URL characters, HTML entities and spaces to appear inside the content attribute.

$matches = array();
preg_match("/<meta name=\"Keywords\" content=\"([\w\d;,\.: %&#\/\\\\]*)\"/", $html, $matches);
echo $matches[1];
gnud
+1  A: 

(.*) matches everything up to the LAST "(quote) in the document, obviously not what you want. Regex is greedy by default. You need to use

content=\"(.*?)\"

or

content=\"([^\"]*)\"
yu_sha
That won't work completely, since he uses the `^`, so the meta-element needs to be at the beginning of the html which should never be the case.
JoostK
+1  A: 

Use the function get_meta_tags();

Tutorial

Cups
When fetching stuff to work on, I am guessing that getting the keywords is only one operation, I always do it in 2 bites. 1) Get the file and store it locally 2) Do my post-fetch rippingI just find that more reliable as so much can go wrong when fetching from the web. But if you're only after the keywords, why bother getting the file, just use file_get_meta() ;
Cups
Was not aware of the get_meta_tags function. Awesome - thanks!
SerpicoLugNut
+3  A: 

I would use a HTML/XML parser like DOMDocument and XPath to retrieve the nodes from the DOM:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$keywords = $xpath->query('//meta[translate(normalize-space(@name), "KEYWORDS", "keywords")="keywords"]/@content');
foreach ($keywords as $keyword) {
    echo $keyword->value;
}

The translate function seems to be necessary as PHP’s XPath implementation does not know the lower-case function.

Or you do the filtering with PHP:

$metas = $xpath->query('//meta');
foreach ($metas as $meta) {
    if ($meta->hasAttribute("name") && trim(strtolower($meta->getAttribute("name")))=='keywords' && $meta->hasAttribute("content")) {
        echo $meta->getAttribute("content")->value;
    }
}
Gumbo
I would +1 if I had any daily votes left :(
meder
+1, except, there is get_meta_tags() built in.
Svante
@Svante: But `get_meta_tags` expects a filename and not the HTML source.
Gumbo
+1  A: 

Stop trying to parse HTMl with regular expressions.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Ether