views:

682

answers:

5

I was thinking of writing a PHP script that would analyse a CMS'd page's content (i.e. database field) and then auto-generate (X)HTML META description & keyword tags, but as always there's no point reinventing the wheel so I'm wondering if anyone knows of such a beastie?

The former I imagine would be something like a relatively straightforward regex to grab the first sentence or two, whereas the latter would probably involve elimination of words against a common-words dictionary and then weighting of frequency or similar.

A: 

Why? There are only two classes of people who use META keyword tags, librarians and spammers. I'm not sure about the former. I do know that search engines ignore them (or, at least, place very little value on them).

Jeff Warnica
1). Because I want to (and I'm asking the question). 2). http://stackoverflow.com/questions/162158/are-meta-keywords-obsolete
da5id
A: 

The Yahoo Pipes Term Extractor module does something similar to what you want. Unfortunately I am not aware of the source to pipes modules being open.

Sparr
+3  A: 

The problems you're considering are twofold: one of keyword extraction and one of document summarization. The first, which I'd obviously use for keywords has a very simple naive approach: pick the most frequent word in the content, minus all stopwords (look this up in Wikipedia if you don't know what these are). There are many more advanced methods, including weighting for the inclusion of synonyms, location in text or markup, and more. There are a few examples of easy keyword extraction scripts in PHP you can implement probably without trouble. Just Google search something like "PHP keyword extraction" and you'll find a few. (or here's the number-one result)

The second problem, on the other hand, is a little more difficult, and is still the source of a lot of academic work. You'd need summarization for a very thorough meta description tag. It may actually not be worth your time if you aren't looking for a long-scale AI project which may still come off as rigid or incoherent. Another approach would be simply a heuristic which uses keyword extraction: "This article is about (first most common keyword), (second most common keyword), and (third most common keyword)." You're at least getting the benefit of fitting in some content in both keyword and description. If you'd like to shake it up, use some synonyms instead. There is a semi-functional PHP implementation of WordNet, but I'd suggest outsourcing to the Natural Language Toolkit for Python for the heavy lifting there, as most of the work is already done for you.

I'd like to take a brief moment to encourage your research in this area and ignore the naysaying from Mr. Warnica. Meta information is important both for document classification and information extraction in the area of search. It would be foolish not to have the data, and it is, in fact, worthwhile to automate it for large-scale content management systems. Good luck with your efforts.

Robert Elwell
Thanks for your considerate answer and thorough understanding of where I'm coming from. I voted you +1 but strangely someone else appears to have done the opposite - Mr Warnica perhaps?
da5id
A: 

or people which are to lazy to generate the keywords and do not rely to other people to use them correct :)

A: 

is there any way to generate meta tags through php. I've several pages to update. If got a script tutorial that would be great and much appreciated.

hemant