ansaurus

Question

Answer 1

+2 A:

Try:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Randy 2009-08-25 10:28:45

If you really want to use REGEX for HTML parsing, don't run .group() directly on match, since it may return None.

iElectric 2009-08-25 10:37:41

You should use `.*?` so in case there are multiple `</title>` in the document (unlikely but you never knows).

tonfa 2009-08-25 10:41:47

@iElectric: you could put it in a try except block if you really want, right?

tonfa 2009-08-25 10:45:14

Answer 2

+2 A:

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

Vinay Sajip 2009-08-25 10:28:53

Answer 3

+4 A:

Use ( ) and group(1) (re.search will return None if it doesn't find the result, so don't use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

krzyk 2009-08-25 10:29:31

If you're not doing anything when no title is found, why would it be a bad thing to use group() directly? (you can catch the exception anyway)

tonfa 2009-08-25 10:52:57

yeah, but most people forget about exceptions, and are really surprised when they see them at runtime :)

krzyk 2009-08-25 18:30:21

Answer 4

+2 A:

Try using capturing groups:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Aaron Maenpaa 2009-08-25 10:30:02

Answer 5

+2 A:

Please, do NOT use regex to parse markup languages. Use lxml or beautifulsoup.

iElectric 2009-08-25 10:31:31

It depends of the use case, sometimes a quick and dirty solution is desirable (especially if you don't want to handle every kind of possible input).

tonfa 2009-08-25 10:43:41

It takes 2min to write a HTML that those regexes will fail or backtrack and thus eat CPU cycles.

iElectric 2009-08-25 10:52:03

But when scraping a website, they don't usually change their html with the purpose of breaking your parser (and in some case you already need to rely on the structure of the generated html instead of just the html tree to extract more information).

tonfa 2009-08-25 11:10:26

@tonfa: I disagree. Many sites seem to go to great lengths to make it very difficult to scrape them. If you scrape the site, you miss out on the beautiful and lucrative advertising they want you to read.

TokenMacGuy 2009-08-25 18:55:17

Answer 6

+1 A:

Using regular expressions to parse the HTML is generally not a good idea. You can use any HTML parser like Beautiful Soup for that. Check out http://www.crummy.com/software/BeautifulSoup/documentation.html

Also remember that some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Cydork 2009-08-25 10:35:08

http://www.google.com/search?q=site%3Astackoverflow.com+regex+"now+(he+OR+they)+(have+OR+has)+two+problems"

Alan Moore 2009-08-25 18:49:00

Answer 7

A:

Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

If your handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package (which isn't in the standard libraries (yet) but is wide recommended for this purpose.

Jim Dennis 2009-08-25 10:35:29

ansaurus

tags:

views:

answers:

python extract contents of regex

related questions