views:

215

answers:

1

Can anyone give me an example on how to use http://code.google.com/p/streamhtmlparser to parse out all the A tag href's from an html document? (either C++ code or python code is ok, but I would prefer an example using the python bindings)

I can see how it works in the python tests, but they expect special tokens already in the html at which points it checks state values. I don't see how to get the proper callbacks during state changes when feeding the parser plain html.

I can get some of the information I am looking for with the following code, but I need to feed it blocks of html not just characters at a time, and i need to know when it's finished with a tag,attribute, etc not just if it's in a tag, attribute, or value.

import py_streamhtmlparser
parser = py_streamhtmlparser.HtmlParser()
html = """<html><body><a href='http://google.com'&gt;link&lt;/a&gt;&lt;/body&gt;&lt;/html&gt;"""
for index, character in enumerate(html):
   parser.Parse(character)
   print index, character, parser.Tag(), parser.Attribute(), parser.Value(), parser.ValueIndex()

you can see a sample run of this code here

A: 
import py_streamhtmlparser
parser = py_streamhtmlparser.HtmlParser()
html = """<html><body><a href='http://google.com' id=100>
        link</a><p><a href=heise.de/></body></html>"""
cur_attr = cur_value = None
for index, character in enumerate(html):
   parser.Parse(character)
   if parser.State() == py_streamhtmlparser.HTML_STATE_VALUE:
      # we are in an attribute value. Record what we got so far
      cur_tag = parser.Tag()
      cur_attr = parser.Attribute()
      cur_value = parser.Value()
      continue
   if cur_value:
      # we are not in the value anymore, but have seen one just before
      print "%r %r %r" % (cur_tag, cur_attr, cur_value)
      cur_value = None

gives

'a' 'href' 'http://google.com'
'a' 'id' '100'
'a' 'href' 'heise.de/'

If you only want the href attributes, check for cur_attr at the point of the print as well.

Edit: The Python bindings currently don't support any kind of event callbacks. So the only output available is the state at the end of processing the respective input. To change that, htmlparser.c:exit_attr (etc.) could be augmented with a callback function. However, this is really not the purpose of streamhtmlparser - it is meant as a templating engine, where you have markers in the source, and you process the input character by character.

Martin v. Löwis
is there any way aside from feeding in single characters though? feeding in individual characters is very good performance wise
Jehiah
I don't understand this question: do you want to feed individual characters, or not? Why do you believe that feeding individual characters has very good performance? I would expect that it behaves relatively poorly.
Martin v. Löwis
oops typo. by 'good' i meant 'bad'. I would rather feed in a whole html document at once not character at a time, as i to believe that would be more efficient.
Jehiah
See my edit. What you wnt is currently not supported.
Martin v. Löwis