views:

266

answers:

2

I am using the Python ElementTree module to manipulate HTML. I want to emphasize certain words, and my current solution is:

for e in tree.getiterator():
    for attr in 'text', 'tail':
        words = (getattr(e, attr) or '').split()
        change = False
        for i, word in enumerate(words):
            word = clean_word.sub('', word)
            if word.lower() in glossary:
                change = True
                words[i] = word.replace(word, '<b>' + word + '</b>')
        if change:
            setattr(e, attr, ' '.join(words))

The above examines the text of each element and emphasizes the important words it finds. However it does this by embedding HTML tags in the text attributes, which is escaped when rendering so that I need to counter with:

html = etree.tostring(tree).replace('&gt;', '>').replace('&lt;', '<')

This makes me uncomfortable so I want to do it properly. However to embed a new Element I would need to shift around the 'text' and 'tail' attributes so that the emphasized text appeared at the same position. And this would be really tricky when iterating as above.

Any advice how to do this properly would be appreciated. I am sure there is something I have missed in the API!

+1  A: 

Although ElementTree is very easy to use for most XML processing tasks, it's also inconvenient for mixed content. I suggest using DOM parser:

from xml.dom import minidom
import re

ws_split = re.compile(r'\s+', re.U).split

def processNode(parent):
    doc = parent.ownerDocument
    for node in parent.childNodes[:]:
        if node.nodeType==node.TEXT_NODE:
            words = ws_split(node.nodeValue)
            new_words = []
            changed = False
            for word in words:
                if word in glossary:
                    text = ' '.join(new_words+[''])
                    parent.insertBefore(doc.createTextNode(text), node)
                    b = doc.createElement('b')
                    b.appendChild(doc.createTextNode(word))
                    parent.insertBefore(b, node)
                    new_words = ['']
                    changed = True
                else:
                    new_words.append(word)
            if changed:
                text = ' '.join(new_words)
                print text
                parent.replaceChild(doc.createTextNode(text), node)
        else:
            processNode(node)

Also I used regexp to split words to avoid them sticking together:

>>> ' '.join(ws_split('a b '))
'a b '
>>> ' '.join('a b '.split())
'a b'
Denis Otkidach
+3  A: 

You can also use xslt and a custom xpath function to do this.

Shown below is an example. It still needs some work, for example cleaning up extra whitespace at the end of elements and handling mixed-text, but it's another idea.

given this input:


<html>
<head>
</head>
<body>
<p>here is some text to bold</p>
<p>and some more</p>
</body>
</html>
and glossary contains the two words: some, bold then example output is :

<?xml version="1.0"?>
<html>
<head/>
<body>
<p>here is <b>some</b> text to <b>bold</b> </p>
<p>and <b>some</b> more </p>
</body>
</html>
Here's the code, I have also posted it at http://bkc.pastebin.com/f545a8e1d

from lxml import etree

stylesheet = etree.XML("""
    <xsl:stylesheet version="1.0"
         xmlns:btest="uri:bolder"
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

        <xsl:template match="@*">
            <xsl:copy />
        </xsl:template>

        <xsl:template match="*">
            <xsl:element name="{name(.)}">
                <xsl:copy-of select="@*" />
                <xsl:apply-templates select="text()" />
                <xsl:apply-templates select="./*" />
            </xsl:element>
        </xsl:template>

        <xsl:template match="text()">
            <xsl:copy-of select="btest:bolder(.)/node()" />
        </xsl:template>         
     </xsl:stylesheet>
""")

glossary = ['some', 'bold']

def bolder(context, s):
    results = []
    r = None
    for word in s[0].split():
        if word in glossary:
            if r is not None:
                results.append(r)
            r = etree.Element('r')
            b = etree.SubElement(r, 'b')
            b.text = word
            b.tail = ' '
            results.append(r)
            r = None
        else:
            if r is None:
                r = etree.Element('r')
            r.text = '%s%s ' % (r.text or '', word)

        if r is not None:
            results.append(r)
    return results

def test():
    ns = etree.FunctionNamespace('uri:bolder') # register global namespace
    ns['bolder'] = bolder # define function in new global namespace
    transform = etree.XSLT(stylesheet)
    print str(transform(etree.XML("""<html><head></head><body><p>here is some text to bold</p><p>and some more</p></body></html>""")))

if __name__ == "__main__":
    test()


bkc