views:

53

answers:

2

I've an XML file which contains no. of <TEXT> </TEXT> tags enclosing text.

<TEXT>

<!-- PJG STAG 4703 -->

<!-- PJG ITAG l=94 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=69 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=50 g=1 f=1 -->


<USDEPT>DEPARTMENT OF AGRICULTURE</USDEPT>

<!-- PJG /ITAG -->

<!-- PJG ITAG l=18 g=1 f=1 -->

<USBUREAU>Packers and Stockyards Administration</USBUREAU>
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=55 g=1 f=1 -->
Amendment to Certification of Central Filing System_Oklahoma
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=11 g=1 f=1 -->
The Statewide central filing system of Oklahoma has been previously certified, pursuant to section 1324 of the Food
Security Act of 1985, on the basis of information submitted by Hannah D. Atkins, Secretary of State, for farm products
produced in that State (52 FR 49056, December 29, 1987).
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->
The certification is hereby amended on the basis of information submitted by John Kennedy, Secretary of State, for
additional farm products produced in that State as follows: Cattle semen, cattle embryos, milo.
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->
This is issued pursuant to authority delegated by the Secretary of Agriculture.
<!-- PJG /ITAG -->

<!-- PJG QTAG 04 -->
<!-- PJG /QTAG -->

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG ITAG l=21 g=1 f=1 -->

<!-- PJG /ITAG -->

<!-- PJG ITAG l=21 g=1 f=4 -->
Authority:
<!-- PJG /ITAG -->

<!-- PJG ITAG l=21 g=1 f=1 -->
 Sec. 1324(c)(2), Pub. L. 99-198, 99 Stat. 1535, 7 U.S.C. 1631(c)(2); 7 CFR 2.18(e)(3), 2.56(a)(3), 55 FR 22795.
<!-- PJG /ITAG -->

<!-- PJG QTAG 02 -->
<!-- PJG /QTAG -->

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG ITAG l=21 g=1 f=1 -->
Dated: January 21, 1994
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->

<SIGNER>
<!-- PJG ITAG l=06 g=1 f=1 -->
Calvin W. Watkins, Acting Administrator,
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->
</SIGNER>
<SIGNJOB>
<!-- PJG ITAG l=04 g=1 f=1 -->
Packers and Stockyards Administration.
<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->
</SIGNJOB>
<FRFILING>
<!-- PJG ITAG l=40 g=1 f=1 -->
[FR Doc. 94-1847 Filed 1-27-94; 8:45 am]
<!-- PJG 0012 frnewline -->

<!-- PJG /ITAG -->
</FRFILING>
<BILLING>
<!-- PJG ITAG l=68 g=1 f=1 -->
BILLING CODE 3410-KD-P
<!-- PJG /ITAG -->
</BILLING>

<!-- PJG 0012 frnewline -->

<!-- PJG 0012 frnewline -->

<!-- PJG /STAG -->
</TEXT>

My task is to extract text out of each of these TEXT nodes. This is what I'm doing:

def getTextFromXML():
    global Text, xmlDoc
    TextNodes = xmlDoc.getElementsByTagName("TEXT")
    docstr = ''
    #Text = [TextFromNode(textNode) for textNode in TextNodes]
    for textNode in TextNodes:
        for cNode in textNode.childNodes:
            if cNode.nodeType == Node.TEXT_NODE:
                docstr+=cNode.data
            else:
                for ccNode in cNode.childNodes:
                    if ccNode.nodeType == Node.TEXT_NODE:
                        docstr+=ccNode.data                
        Text.append(docstr)

Problem is that it is taking hell lot of time. I guess my function is not efficient. Can any one kindly advise me how this can be improved?

EDIT: The file I'm dealing with contains around 6000+ <TEXT> text elements.

+1  A: 

lxml is much easier to use than the xml libraries included in the standard python library. It's a binding for the C libxml2 library, so I'm assuming it's also faster.

I'd do something like this (using your variable names):

from lxml import etree
with open('some-file.xml') as f:
    xmlDoc = etree.parse(f)
    root = xmlDoc.getroot()

    Text = []
    for textNode in root.xpath('TEXT'):
        docstr = '\n'.join(text.strip() for text in textNode.xpath('*/text() | text()') if text.strip())
        Text.append(docstr)
ma3
A: 

If you use lxml (or xml.etree in Python 2.7), you can use the .itertext() method on an element, eg.:

s = ''.join(elem.itertext())

With lxml, you could probably also use the string() xpath function (may be faster because all the work is done by libxml2 itself, and not in python):

s = elem.xpath('string()')
Steven