ansaurus

Question

Answer 1

A:

import py_streamhtmlparser
parser = py_streamhtmlparser.HtmlParser()
html = """<html><body><a href='http://google.com' id=100>
        link</a><p><a href=heise.de/></body></html>"""
cur_attr = cur_value = None
for index, character in enumerate(html):
   parser.Parse(character)
   if parser.State() == py_streamhtmlparser.HTML_STATE_VALUE:
      # we are in an attribute value. Record what we got so far
      cur_tag = parser.Tag()
      cur_attr = parser.Attribute()
      cur_value = parser.Value()
      continue
   if cur_value:
      # we are not in the value anymore, but have seen one just before
      print "%r %r %r" % (cur_tag, cur_attr, cur_value)
      cur_value = None

gives

'a' 'href' 'http://google.com'
'a' 'id' '100'
'a' 'href' 'heise.de/'

If you only want the href attributes, check for cur_attr at the point of the print as well.

Edit: The Python bindings currently don't support any kind of event callbacks. So the only output available is the state at the end of processing the respective input. To change that, htmlparser.c:exit_attr (etc.) could be augmented with a callback function. However, this is really not the purpose of streamhtmlparser - it is meant as a templating engine, where you have markers in the source, and you process the input character by character.

Martin v. Löwis 2009-08-14 17:05:35

is there any way aside from feeding in single characters though? feeding in individual characters is very good performance wise

Jehiah 2009-08-14 21:30:12

I don't understand this question: do you want to feed individual characters, or not? Why do you believe that feeding individual characters has very good performance? I would expect that it behaves relatively poorly.

Martin v. Löwis 2009-08-14 22:12:25

oops typo. by 'good' i meant 'bad'. I would rather feed in a whole html document at once not character at a time, as i to believe that would be more efficient.

Jehiah 2009-08-15 03:00:12

See my edit. What you wnt is currently not supported.

Martin v. Löwis 2009-08-18 07:42:31

ansaurus

tags:

views:

answers:

example for using streamhtmlparser

related questions