views:

434

answers:

7

hello,

I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '')

Is there a regular expression that will extract just the contents of so I don't have to remove the tags?

thanks!

+2  A: 

Try:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
Randy
If you really want to use REGEX for HTML parsing, don't run .group() directly on match, since it may return None.
iElectric
You should use `.*?` so in case there are multiple `</title>` in the document (unlikely but you never knows).
tonfa
@iElectric: you could put it in a try except block if you really want, right?
tonfa
+2  A: 

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

Vinay Sajip
+4  A: 

Use ( ) and group(1) (re.search will return None if it doesn't find the result, so don't use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)
krzyk
If you're not doing anything when no title is found, why would it be a bad thing to use group() directly? (you can catch the exception anyway)
tonfa
yeah, but most people forget about exceptions, and are really surprised when they see them at runtime :)
krzyk
+2  A: 

Try using capturing groups:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
Aaron Maenpaa
+2  A: 

Please, do NOT use regex to parse markup languages. Use lxml or beautifulsoup.

iElectric
It depends of the use case, sometimes a quick and dirty solution is desirable (especially if you don't want to handle every kind of possible input).
tonfa
It takes 2min to write a HTML that those regexes will fail or backtrack and thus eat CPU cycles.
iElectric
But when scraping a website, they don't usually change their html with the purpose of breaking your parser (and in some case you already need to rely on the structure of the generated html instead of just the html tree to extract more information).
tonfa
@tonfa: I disagree. Many sites seem to go to great lengths to make it very difficult to scrape them. If you scrape the site, you miss out on the beautiful and lucrative advertising they want you to read.
TokenMacGuy
+1  A: 

Using regular expressions to parse the HTML is generally not a good idea. You can use any HTML parser like Beautiful Soup for that. Check out http://www.crummy.com/software/BeautifulSoup/documentation.html

Also remember that some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Cydork
http://www.google.com/search?q=site%3Astackoverflow.com+regex+"now+(he+OR+they)+(have+OR+has)+two+problems"
Alan Moore
A: 

Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

If your handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package (which isn't in the standard libraries (yet) but is wide recommended for this purpose.

Jim Dennis