views:

115

answers:

2

I'm beginning to learn python. My python version is 3.1

I've never learnt OOP before, so I'm confused by the HTMLParser.

from html.parser import HTMLParser
class parser(HTMLParser):
def handle_data(self, data):
      print(data)

p = parser()
page = """<html><h1>title</h1><p>I'm a paragraph!</p></html>"""
p.feed(page)

I'll get this:

title

I'm a paragraph!

I want this data passed to a function, what should I do?

Sorry for my poor English and Thank you for your help!

+1  A: 

Just an example:

def my_global_fun(data):
    print "processing", data

class parser(HTMLParser):
    def my_member_fun(self, data):
        print "processing", data

    def handle_data(self, data):
        self.my_member_fun(data)
        # or
        my_global_fun(data)

Good luck learning OOP!

ron
Should also check out lxml (http://codespeak.net/lxml/) later for sanitizing real-world html. Or alternatives like BeautifoulSoup, etc.
ron
Thank you! Your answer really helped a lot.but I'm still wondering:parser.feed("html file") is called from a func0, how can func0 get the data generated by the parser?
zjk
+3  A: 

I did not look into the HTMLParser module itself, but I can see that feed inherently calls handle_data, which in your derived class does a print. @ron's answer suggests passing the data directly to your function, which is totally OK. However, since you are new to OOP, maybe take a look at this code.

This is Python, 2.x, but I think the only thing that would change is the package location, html.parser instead of HTMLParser.

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
    def handle_data(self, data):
        self.output.append(data)
    def feed(self, data):
        self.output = []
        HTMLParser.feed(self, data)


p = MyParser()
page = """<html><h1>title</h1><p>I'm a paragraph!</p></html>"""
p.feed(page)

print p.output

output
['title', "I'm a paragraph!"]

Here I am overriding the feed method of HTMLParser. Instead, when the call is made p.feed(page) it will call my method, which creates / sets an instance variable called output to an empty list and then calls the feed method in the base class (HTMLParser) and proceeds with what it does normally. So, by overriding the feed method I was able to do some extra stuff (added a new output variable). The handle_data method similarly is an override method. In fact, the handle_data method of HTMLParser doesn't even do anything... nothing at all (according to the docs.)

So, just to clarify...

You call p.feed(page) which calls the MyParser.feed method MyParser.feed sets a variable self.output to and empty list then calls HTMLParser.feed The handle_data method adds the line onto the end of the output list.

You now have access to the data via a call to p.output.

sberry2A
Your explanation is incredibly clear. This should be in the textbook. Thanks a lot! I'm just about to give my accepted answer to Ron because your score is higher than his. But since your answer is so good that it may help others, I think I should give the accepted answer to you
zjk