ansaurus

Question

Extract Meta Keywords From Webpage?

Answer 1

+6 A:

BeautifulSoup is a great way to parse HTML with Python.

Particularly, check out the findAll method: http://www.crummy.com/software/BeautifulSoup/documentation.html

orangeoctopus 2010-07-09 19:17:55

Answer 2

+2 A:

lxml is faster than BeautifulSoup (I think) and has much better functionality, while remaining relatively easy to use. Example:

52> from urllib import urlopen
53> from lxml import etree

54> f = urlopen( "http://www.google.com" ).read()
55> tree = etree.HTML( f )
61> m = tree.xpath( "//meta" )

62> for i in m:
..>     print etree.tostring( i )
..>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-2"/>

Edit: another example.

75> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
76> tree = etree.HTML( f )
85> tree.xpath( "//meta[@name='Keywords']" )[0].get("content")
85> "xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql,colors,soap,php,authoring,programming,training,learning,b
eginner's guide,primer,lessons,school,howto,reference,examples,samples,source code,tags,demos,tips,links,FAQ,tag list,forms,frames,color table,w3c,cascading
 style sheets,active server pages,dynamic html,internet,database,development,Web building,Webmaster,html guide"

BTW: XPath is worth knowing.

Another edit:

Alternatively, you can just use regexp:

87> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
88> import re
101> re.search( "<meta name=\"Keywords\".*?content=\"([^\"]*)\"", f ).group( 1 )
101>"xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql, ...etc...

...but I find it less readable and more error prone (but involves only standard module and still fits on one line).

cji 2010-07-09 19:34:10

Ok, but where are the keywords of the document. I need to check the keywords in the meta data against a list I have.

Zachary Brown 2010-07-09 19:51:42

As you can see they are in 'content' attribute of `<meta>` tag which 'name' attribute is 'Keywords' :)

cji 2010-07-09 20:07:30

ansaurus

tags:

views:

answers:

Extract Meta Keywords From Webpage?

related questions