ansaurus

Question

Answer 1

+2 A:

If you want the div tags included in the matched item:

regexpHandler = re.compile('(<div class="deg">.*?</div>)')

If you don't want the div tags included, only the DATA portion:

regexpHandler = re.compile('<div class="deg">(.*?)</div>')

Then to run the match and get the result:

result = regexHandler.search( pageData )
matchedText = result.groups()[0]

Amber 2009-08-09 21:25:09

Your second part returns all string with <div..> tags. Any ideas?

Ockonal 2009-08-09 21:32:45

Ah, sorry - it needed to be `result.groups()[0]` instead of `result.group()` in order to get the capture match instead of the entire string matched. :)

Amber 2009-08-09 22:41:34

Thanks. It works ;)

Ockonal 2009-08-10 06:55:46

Answer 2

+5 A:

I suggest using a good HTML parser (such as BeautifulSoup -- but for your purposes, i.e. with well-formed HTML as input, the ones that come with the Python standard library, such as HTMLParser, should also work well) rather than raw REs to parse HTML.

If you want to persist with the raw RE approach, the pattern:

r'<div class="deg">([^<]*)</div>'

looks like the simplest way to get the string 'DATA' out of the string '<div class="deg">DATA</div>' -- assuming that's what you're after. You may need to add one or more \s* in spots where you need to tolerate optional whitespace.

Alex Martelli 2009-08-09 21:26:14

Why the more complex `([^<]*)` group? A non-greedy `.*?` should work fine.

Amber 2009-08-09 21:28:14

`.*?` will tolerate (and absorb) embedded tags, and get out of balance if the div contains another div inside (grabbing the start but not the end of the inner dir), while the pattern I suggest will only match when the div contains pure textual data, i.e., no embedded tags, which seems sounder in the absence of clear specs. Such complications are part of why I start by recommending *avoiding* bare REs for HTML parsing, and reusing instead, for the purpose, any of the many excellent existing modules, both in the standard library and third-party ones.

Alex Martelli 2009-08-09 21:32:17

Different approaches to a suboptimal treatment of the problem, I suppose - in the absence of a true parser, you choose to go the stricter route, I choose to go the looser route; either one has its advantages and disadvantages. I agree that some form of actual DOM parser would be more ideal for a general-case of this problem, but that's up to Ockonal to make the call on, since he's the one with the best knowledge of the data he'll be getting. Thanks for the good discussion though. :)

Amber 2009-08-09 22:45:00

@Dav, yep, good points. I should point out that REs (unless extended to be way more than REs, like in recent versions of Perl) notoriously can't do "parsing with balanced parentheses"... and that's what XML and HTML are all about, making REs _especially_ unsuitable for such tasks (even though people keep trying!-).

Alex Martelli 2009-08-09 22:48:28

+1 Use html parsers for html!

TokenMacGuy 2009-08-10 06:12:42

Thank you. Your post is very informative for me. I'll learn about it later.

Ockonal 2009-08-10 06:56:23

Answer 3

A:

you can use simple string functions in Python, no need for regex

mystr = """< div class="deg">DATA< /div>"""
if "div" in mystr and "class" in mystr and "deg" in mystr:
    s = mystr.split(">")
    for n,item in enumerate(s):
        if "deg" in item:
            print s[n+1][:s[n+1].index("<")]

my approach, get something to split on. eg in the above, i split on ">". Then go through the splitted items, check for "deg", and get the item after it, since "deg" appears before the data you want to get. of course, this is not the only approach.

ghostdog74 2009-08-10 05:40:46

Answer 4

A:

While it is ok to use rexex for quick and dirty html processing a much better and cleaner way is to use a html parser like lxml.html and to query the parsed tree with XPath or CSS Selectors.

html = """<html><body><div class="deg">DATA1</div><div class="deg">DATA2</div></body></html>"""

import lxml.html

page = lxml.html.fromstring(html)
#page = lxml.html.parse(url)

for element in page.findall('.//div[@class="deg"]'):
    print element.text

#using css selectors
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.deg")

for element in sel(page):
    print element.text

Peter Hoffmann 2009-08-10 13:06:44

ansaurus

tags:

views:

answers:

Python and web-tags regex

related questions