views:

156

answers:

2

When using the HTMLParser class in Python, is it possible to abort processing within a handle_* function? Early in the processing, I get all the data I need, so it seems like a waste to continue processing. There's an example below of extracting the meta description for a document.

from HTMLParser import HTMLParser

class MyParser(HTMLParser):

    def handle_start(self, tag, attrs):
        in_meta = False
        if tag == 'meta':
          for attr in attrs:
              if attr[0].lower() == 'name' and attr[1].lower() == 'description':
                  in_meta = True
              if attr[0].lower() == 'content':
                  print(attr[1])
                  # Would like to tell the parser to stop now,
                  # since I have all the data that I need
+1  A: 

You can raise an exception and wrap your .feed() call in a try block.

You can also call self.reset() when you decide, that you are done (I have not actually tried it, but according to documentation "Reset the instance. Loses all unprocessed data.", - this is precisely what you need).

shylent
An exception doesn't sound like a nice idea here - exceptions should be used only for exceptional conditions, and in this case you just propose it to be used as a control-flow tool. As for the 'reset' method, I've considered it too but I can't figure out if it's really relevant here
Eli Bendersky
re: "exceptions .. for exceptional conditions" - not so true for python. Do you know, that StopIteration is raised whenever an iterator "runs out of" iterations? That's not much of an "exceptional condition", now is it? In fact it is distinctly similar to the condition, that the questioner wants to handle, - a "break now" kind of condition.
shylent
@shylent: true about StopIteration, but that is rarely handled manually, but rather is wrapped so that the user almost never sees it directly. Nevertheless, you're making a good point.
Eli Bendersky
A: 

If you use pyparsing's scanString method, you have more control over how far you actually go through the input string. In your example, we create an expression that matches a <meta> tag, and add a parse action that ensures that we only match the tag with name="description". This code assumes that you have read the page's HTML into the variable htmlsrc:

from pyparsing import makeHTMLTags, withAttribute

# makeHTMLTags creates both open and closing tags, only care about the open tag
metaTag = makeHTMLTags("meta")[0]
metaTag.setParseAction(withAttribute(name="description"))

try:
    # scanString is a generator that returns each match as it is found
    # in the input
    tokens,startloc,endloc = metaTag.scanString(htmlsrc).next()

    # attributes can be accessed like object attributes if they are 
    # valid Python names
    print tokens.content

    # if the attribute name clashes with a Python keyword, or is 
    # otherwise unsuitable as an identifier, use dict-like access instead
    print tokens["content"]

except StopIteration:
    print "no matching meta tag found"
Paul McGuire
Thanks for the answer. I'm sure this works as well and I appreciate having somewhat of an introduction to pyparsing. I would mark both correct if I could.
Michael Mior