views:

47

answers:

2

Somewhat related to my earlier question. I'm making a simple html parser to play around with in Python 2.7. I would like to have multiple parse types, IE can parse for links, script tags, images, ect. I'm using the HTMLParser module, so my initial thoughts were just make a separate class for each thing I want to parse. But that seemed rather silly. Is there a way to go about doing this without creating multiple classes? I am more familar with C#, so I figured I'd just pass a parameter on the init method to specify what exactly to parse for, just like I would in .Net, however I don't seem to be doing it correctly. It doesn't work, and it just doesn't 'look' right. Here's the current working code: How would I modify this to I can just have the one class, and the parameters that are passed indicate the type of HTML tags to parse?

class LinksParser(HTMLParser):
  def __init__(self, url):
    HTMLParser.__init__(self)
    req = urllib2.urlopen(url)
    self.feed(req.read())

  def handle_starttag(self, tag, attrs):
    if tag !='a': return
    for name, value in attrs:
      print("Found Link --> [{0}]{1}".format(name, value))
A: 

Something like that:

class MyParser(HTMLParser):
    def __init__(self, url, tags):
        HTMLParser.__init__(self)
        self.tags = tags
        req = urllib2.urlopen(url)
        self.feed(req.read())

    def handle_starttag(self, tag, attrs):
        if tag not in self.tags: return
        for name, value in attrs:
            print("Found Tag --> [{0}]{1}".format(name, value))

instantiate the class with something like:

p = MyParser("http://www.google.com", [ 'a', 'img' ])
RC
+1  A: 
class TagParser(HTMLParser):

    def __init__(self, url, tag):
        HTMLParser.__init__(self)
        self.tag = tag
        req = urllib2.urlopen(url)
        self.feed(req.read())

    def handle_starttag(self, tag, attrs):
        if tag != self.tag: return
        for name, value in attrs:
            print("Found Tag({2}) --> [{0}]{1}".format(name, value, self.tag))
Alex Martelli
Forgot about using self.everything in python. Thanks.
Stev0
@Stev0, you're welcome!
Alex Martelli