ansaurus

Question

Python regex help needed

Answer 1

+7 A:

Don't parse HTML with regex.

Regex is not the right tool to use for this problem. Look up BeautifulSoup or lxml.

katrielalex 2010-08-01 15:28:32

ah - I love this.

Tim 2010-08-01 15:39:53

Answer 2

+1 A:

Although regular expressions are not your best choice for parsing HTML.

For the sake of education, here is a possible answer to your question:

start = '<(?P<tag>font|tag) color="red">'
end = '</(?P=tag)>'
expression = start + '(.*?)' + end

WoLpH 2010-08-01 15:31:18

Answer 3

+1 A:

expression = '(<font color="red">(.*?)</font>|<span style="font-weight:bold;">(.*?)</span>)'
match = re.compile(expression).search(web_source_code)
needed_info = match.group(2)

This would get the job done but you shouldn't really be using regex to parse html

Ed 2010-08-01 15:31:53

Answer 4

A:

I haven't used Python, but if you make expressions equal to the following, it should work:

/(?P<open><(font|span)[^>]*>)(?P<info>[^<]+)(?P<close><\/(font|span)>)/gi

Then just access your needed info with the name "info".

PS - I also agree about the "not parsing HTML with regex" rule, but if you know that it will appear in either font or span tags, then so be it...

Also, why use the font tag? I haven't used a font tag since I learned CSS.

Tim 2010-08-01 15:31:57

Answer 5

+2 A:

You can join two alternatives with a vertical bar:

start = '<font color="red">|<span style="font-weight:bold;">'
end = '</font>|</span>'

since you know that a font tag will always be closed by </font>, a span tag always by </span>.

However, consider also using a solid HTML parser such as BeautifulSoup, rather than rolling your own regular expressions, to parse HTML, which is particularly unsuitable in general for getting parsed by regular expressions.

Alex Martelli 2010-08-01 15:31:59

Thanks, it works.

Ando 2010-08-01 15:43:40

+1 unlike mine, an actually helpful answer =)

katrielalex 2010-08-01 15:45:39

Answer 6

+1 A:

Regex and HTML are not such a good match, HTML has too many potential variations that will trip up your regex. BeautifulSoup is the standard tool to employ here, but I find pyparsing can be just as effective, and sometimes even simpler to construct when trying to locate a particular tag relative to a particular previous tag.

Here is how to address your question using pyparsing:

html = """ need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
<font color="white">but not this info</font> and 
<span style="font-weight:normal;">dont want this either</span>
"""

from pyparsing import *

font,fontEnd = makeHTMLTags("FONT")
# only match <font> tags with color="red"
font.setParseAction(withAttribute(color="red"))
# only match <span> tags with given style
span,spanEnd = makeHTMLTags("SPAN")
span.setParseAction(withAttribute(style="font-weight:bold;"))

# define full match patterns, define "body" results name for easy access
fontpattern = font + SkipTo(fontEnd)("body") + fontEnd
spanpattern = span + SkipTo(spanEnd)("body") + spanEnd

# now create a single pattern, matching either of the other patterns
searchpattern = fontpattern | spanpattern

# call searchString, and extract body element from each match
for text in searchpattern.searchString(html):
    print text.body

Prints:

needed-info-here
needed-info-here

Paul McGuire 2010-08-01 15:47:05

ansaurus

tags:

views:

answers:

Python regex help needed

related questions