ansaurus

Question

Python regex look-behind requires fixed-width pattern

Answer 1

+5 A:

Toss out the idea of parsing HTML with regular expressions and use an actual HTML parsing library instead. After a quick search I found this one. It's a much safer way to extract information from an HTML file.

Remember, HTML is not a regular language so regular expressions are fundamentally the wrong tool for extracting information from it.

Welbog 2010-04-10 11:47:16

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is also a good option.

Matthew Flaschen 2010-04-10 11:52:39

Answer 2

+2 A:

Here's a famous answer on parsing html with regular expressions that does a great job of saying, "don't use regex to parse html."

Stephen Harmon 2010-04-10 13:01:49

Answer 3

+1 A:

If you just want to get the title tag,

html=urllib2.urlopen("http://somewhere").read()
for item in html.split("</title>"):
    if "<title>" in item:
        print item[ item.find("<title>")+7: ]

ghostdog74 2010-04-10 13:04:30

Answer 4

+1 A:

What about something like:

 r = re.compile("(<title.*>)([\s\S]*)(</title>)")
 title = r.search(page).group(2)

Vojtech R. 2010-04-10 17:22:53

ansaurus

tags:

views:

answers:

Python regex look-behind requires fixed-width pattern

related questions