tags:

views:

110

answers:

4

Hello, i have need webpage-content. I need to get some data from it. It looks like:

< div class="deg">DATA< /div>

As i understand, i have to use regex, but i can't choose one.

I tried the code below but had no any results. Please, correct me:

regexHandler = re.compile('(<div class="deg">(?P<div class="deg">.*?)</div>)')
result = regexHandler.search( pageData )
+2  A: 

If you want the div tags included in the matched item:

regexpHandler = re.compile('(<div class="deg">.*?</div>)')

If you don't want the div tags included, only the DATA portion:

regexpHandler = re.compile('<div class="deg">(.*?)</div>')

Then to run the match and get the result:

result = regexHandler.search( pageData )
matchedText = result.groups()[0]
Amber
Your second part returns all string with <div..> tags. Any ideas?
Ockonal
Ah, sorry - it needed to be `result.groups()[0]` instead of `result.group()` in order to get the capture match instead of the entire string matched. :)
Amber
Thanks. It works ;)
Ockonal
+5  A: 

I suggest using a good HTML parser (such as BeautifulSoup -- but for your purposes, i.e. with well-formed HTML as input, the ones that come with the Python standard library, such as HTMLParser, should also work well) rather than raw REs to parse HTML.

If you want to persist with the raw RE approach, the pattern:

r'<div class="deg">([^<]*)</div>'

looks like the simplest way to get the string 'DATA' out of the string '<div class="deg">DATA</div>' -- assuming that's what you're after. You may need to add one or more \s* in spots where you need to tolerate optional whitespace.

Alex Martelli
Why the more complex `([^<]*)` group? A non-greedy `.*?` should work fine.
Amber
`.*?` will tolerate (and absorb) embedded tags, and get out of balance if the div contains another div inside (grabbing the start but not the end of the inner dir), while the pattern I suggest will only match when the div contains pure textual data, i.e., no embedded tags, which seems sounder in the absence of clear specs. Such complications are part of why I start by recommending *avoiding* bare REs for HTML parsing, and reusing instead, for the purpose, any of the many excellent existing modules, both in the standard library and third-party ones.
Alex Martelli
Different approaches to a suboptimal treatment of the problem, I suppose - in the absence of a true parser, you choose to go the stricter route, I choose to go the looser route; either one has its advantages and disadvantages. I agree that some form of actual DOM parser would be more ideal for a general-case of this problem, but that's up to Ockonal to make the call on, since he's the one with the best knowledge of the data he'll be getting. Thanks for the good discussion though. :)
Amber
@Dav, yep, good points. I should point out that REs (unless extended to be way more than REs, like in recent versions of Perl) notoriously can't do "parsing with balanced parentheses"... and that's what XML and HTML are all about, making REs _especially_ unsuitable for such tasks (even though people keep trying!-).
Alex Martelli
+1 Use html parsers for html!
TokenMacGuy
Thank you. Your post is very informative for me. I'll learn about it later.
Ockonal
A: 

you can use simple string functions in Python, no need for regex

mystr = """< div class="deg">DATA< /div>"""
if "div" in mystr and "class" in mystr and "deg" in mystr:
    s = mystr.split(">")
    for n,item in enumerate(s):
        if "deg" in item:
            print s[n+1][:s[n+1].index("<")]

my approach, get something to split on. eg in the above, i split on ">". Then go through the splitted items, check for "deg", and get the item after it, since "deg" appears before the data you want to get. of course, this is not the only approach.

ghostdog74
A: 

While it is ok to use rexex for quick and dirty html processing a much better and cleaner way is to use a html parser like lxml.html and to query the parsed tree with XPath or CSS Selectors.

html = """<html><body><div class="deg">DATA1</div><div class="deg">DATA2</div></body></html>"""

import lxml.html

page = lxml.html.fromstring(html)
#page = lxml.html.parse(url)

for element in page.findall('.//div[@class="deg"]'):
    print element.text

#using css selectors
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.deg")

for element in sel(page):
    print element.text
Peter Hoffmann