tags:

views:

87

answers:

6

I need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.

I can get it when I use

start = '<font color="red">'
end = '</font>'
expression = start + '(.*?)' + end
match = re.compile(expression).search(web_source_code)
needed_info = match.group(1)

, but then I have to pick to fetch either <font> or <span>, failing, when the site uses the other tag.

How do I modify the regular expression so it would always succeed?

+7  A: 

Don't parse HTML with regex.

Regex is not the right tool to use for this problem. Look up BeautifulSoup or lxml.

katrielalex
ah - I love this.
Tim
+1  A: 

Although regular expressions are not your best choice for parsing HTML.

For the sake of education, here is a possible answer to your question:

start = '<(?P<tag>font|tag) color="red">'
end = '</(?P=tag)>'
expression = start + '(.*?)' + end
WoLpH
+1  A: 
expression = '(<font color="red">(.*?)</font>|<span style="font-weight:bold;">(.*?)</span>)'
match = re.compile(expression).search(web_source_code)
needed_info = match.group(2)

This would get the job done but you shouldn't really be using regex to parse html

Ed
A: 

I haven't used Python, but if you make expressions equal to the following, it should work:

/(?P<open><(font|span)[^>]*>)(?P<info>[^<]+)(?P<close><\/(font|span)>)/gi

Then just access your needed info with the name "info".

PS - I also agree about the "not parsing HTML with regex" rule, but if you know that it will appear in either font or span tags, then so be it...

Also, why use the font tag? I haven't used a font tag since I learned CSS.

Tim
+2  A: 

You can join two alternatives with a vertical bar:

start = '<font color="red">|<span style="font-weight:bold;">'
end = '</font>|</span>'

since you know that a font tag will always be closed by </font>, a span tag always by </span>.

However, consider also using a solid HTML parser such as BeautifulSoup, rather than rolling your own regular expressions, to parse HTML, which is particularly unsuitable in general for getting parsed by regular expressions.

Alex Martelli
Thanks, it works.
Ando
+1 unlike mine, an actually helpful answer =)
katrielalex
+1  A: 

Regex and HTML are not such a good match, HTML has too many potential variations that will trip up your regex. BeautifulSoup is the standard tool to employ here, but I find pyparsing can be just as effective, and sometimes even simpler to construct when trying to locate a particular tag relative to a particular previous tag.

Here is how to address your question using pyparsing:

html = """ need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
<font color="white">but not this info</font> and 
<span style="font-weight:normal;">dont want this either</span>
"""

from pyparsing import *

font,fontEnd = makeHTMLTags("FONT")
# only match <font> tags with color="red"
font.setParseAction(withAttribute(color="red"))
# only match <span> tags with given style
span,spanEnd = makeHTMLTags("SPAN")
span.setParseAction(withAttribute(style="font-weight:bold;"))

# define full match patterns, define "body" results name for easy access
fontpattern = font + SkipTo(fontEnd)("body") + fontEnd
spanpattern = span + SkipTo(spanEnd)("body") + spanEnd

# now create a single pattern, matching either of the other patterns
searchpattern = fontpattern | spanpattern

# call searchString, and extract body element from each match
for text in searchpattern.searchString(html):
    print text.body

Prints:

needed-info-here
needed-info-here
Paul McGuire