Can anyone give me an example on how to use http://code.google.com/p/streamhtmlparser to parse out all the A
tag href's from an html document? (either C++ code or python code is ok, but I would prefer an example using the python bindings)
I can see how it works in the python tests, but they expect special tokens already in the html at which points it checks state values. I don't see how to get the proper callbacks during state changes when feeding the parser plain html.
I can get some of the information I am looking for with the following code, but I need to feed it blocks of html not just characters at a time, and i need to know when it's finished with a tag,attribute, etc not just if it's in a tag, attribute, or value.
import py_streamhtmlparser
parser = py_streamhtmlparser.HtmlParser()
html = """<html><body><a href='http://google.com'>link</a></body></html>"""
for index, character in enumerate(html):
parser.Parse(character)
print index, character, parser.Tag(), parser.Attribute(), parser.Value(), parser.ValueIndex()
you can see a sample run of this code here