I want to fetch the title of a webpage which I open using urllib2. What is the best way to do this, to parse the html and find what I need (for now only the -tag but might need more in the future).
Is there a good parsing lib for this purpose?
I want to fetch the title of a webpage which I open using urllib2. What is the best way to do this, to parse the html and find what I need (for now only the -tag but might need more in the future).
Is there a good parsing lib for this purpose?
Here You will find some libs for html/xml parsing. Choice depends on what You need and what fits Your needs.
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Use Beautiful Soup.
html = urllib2.urlopen("...").read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
print soup.title.string
Try Beautiful Soup:
url = 'http://www.example.com'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
title = soup.html.head.title
print title.contents
Yes I would recommend BeautifulSoup
If you're getting the title it's simply:
soup = BeautifulSoup(html)
myTitle = soup.html.head.title
or
myTitle = soup('title')
Taken from the documentation
It's very robust and will parse the html no matter how messy it is.