tags:

views:

305

answers:

2

I need one specific 'div'-tag (identified by 'id') from a html site. To parse the page I'm using cyberneko.

    def doc = new XmlParser( new org.cyberneko.html.parsers.SAXParser() ).parse(htmlFile)
    divTag = doc.depthFirst().DIV.find{ it['@id'] == tagId  }

So far no problem, but at the end I don't need XML, but the original content of the whole 'div' tag. Unfortunatly I can't figure out how to do this...

A: 

EDIT: Responding to first comment.

This works:

def html = """
  <body>
        <div id="breadcrumbs">
            <b>
            crumb1
            </b>
        </div>
</body>
"""

def doc = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(html)
divTag = doc.BODY.DIV.find { it.@id == 'breadcrumbs'  }
println "" << new groovy.xml.StreamingMarkupBuilder().bind {xml -> xml.mkp.yield divTag}

It looks like cyberneko will return a well formed HTML document, regardless of whether the original markup was. i.e., doc's root will be a HTML element, and there will also be a HEAD element. Neat.

noah
Wow, thats a very cool/groovy idea! :)Unfortunatly it seems not (yet) to work - the retunred 'doc' does not contain any tag, but only text contents of the original content. Therfore the 'doc.DIV.find{}' will never find a 'DIV'.For readability, I'll add a simple test in a new post/answer.
domi
Noah: thanks - this does all the trick! But I do still have an other problem: as the original HTML page has a namespace defined (xmlns="http://www.w3.org/1999/xhtml"), all tags in the result will have it too. To get rid of it, I had to add this to the bind{xml.mkp.declareNamespace(tag0:null); xml.mkp.yield divTag} I'm sure this is ahack, but unfortunatly I can't find a good documentation about this special namespace 'mkp' nor any other solution...
domi
@domi I believe it's cyberneko that is adding the namespace. Did you check the docs for it?
noah
Noah: you'r right, I found the correct feature to be set on the org.cyberneko.html.parsers.SAXParser: 'parser.setFeature("http://xml.org/sax/features/namespaces", false)' now all is just perfect! :)
domi
A: 

This is a simple test based on noah's answer - unfortunatly it does not (yet) work :(

    def html = """
      <body>
            <div id="breadcrumbs">
                <b>
                crumb1
                </b>
            </div>
    </body>
    """

    def doc = new XmlSlurper( new org.cyberneko.html.parsers.SAXParser() ).parseText(html)
    println "document: $doc"
    def htmlTag = doc.DIV.find {
        println "-> $it"
        it['@id'] == "breadcrumbs"
    }
    println htmlTag
    assert htmlTag
domi
I've edited my answer to use this test case. One thing to note is that the toString of a GPathResult (which doc is) is not the XML it contains, it's just the text content (I don't know why, that's pretty much useless). That's why the StreamingMarkupBuilder().bind {xml... trick is necessary.
noah