views:

127

answers:

2

Hi All,

I have started using Jython as it seems to be a excellent language, and has proved to be so far.

I am using dom4j to manipulate and retrieve data from the DOM of a bunch of HTML files I have on disk. I have wrote the below script to check threw the DOM using Xpath for H1 tags and grab text, if a H1 tag is not present in the DOM it then searches for the title tag and grabs the text from that.

I am very new to Jython but I am sure there is way to perform the required task a lot more graceful than the below method, If I am right in thinking this, is there someone that can show me a better way to do it?

elemHolder = dom.createXPath('//xhtml:h1')
elemHolder.setNamespaceURIs(map)
elem = elemHolder.selectSingleNode(dom)
if elem != None:
    h1 = elem.getText()
else:
    elemHolder = dom.createXPath('//xhtml:title')
    elemHolder.setNamespaceURIs(map)
    elem = elemHolder.selectSingleNode(dom)
    if elem != None:
     title = elem.getText()
    else:
     title = "Page does not contain a H1 or title tag"

If anyone could help it would be great. Cheers

+2  A: 

How about this (I don't claim to know much about Python, by the way, but this looks like an obvious first step):

for path in ('//xhtml:h1', '//xhtml:title'):
    elemHolder = dom.createXPath(path)
    elemHolder.namespaceURIs = map
    elem = elemHolder.selectSingleNode(dom)
    if elem is not None:
        return (elem.localName, elem.text)

return (None, "Page does not contain h1 or title tag")
Chris Jester-Young
I got the concept and tweaked it to work. Cheers mate
Eef
A: 

That looks like it would work perfectly, only other thing is. I will be passing the value to a database and depending what was found its put in the appropriate column.

If its a H1 tag it will put it in the H1 column and if its a title tag it will get put in the title column.

Is there a way to detemine what tag was found also? Does this make sense?

Eef
Yes, I've now made the function return a tuple, the first element of which is the tag name, and the second element of which is the result.
Chris Jester-Young