views:

92

answers:

5

Hello,

I'm trying to complete a simple task in Python and I'm new to the language (I'm C++). I hope someone might be able to point me in the right direction.

Problem: I have an XML file (12mb) full of data and within the file there are start tags 'xmltag' and end tags '/xmltag' that represent the start and end of the data sections I would like to pull out.

I would like to navigate through this open file with a loop and for each instance locate a start tag and copy the data within the section to a new file until the end tag. I would then like to repeat this to the end of the file.

I'm happy with the file I/O but not the most efficient looping, searching and extracting of the data.

I really like the look of the language and hopefully I'm going to get more involved so I can give back to the community.

Big thanks!

+3  A: 

Check BeautifulSoup

from BeautifulSoup import BeautifulSoup

with open('bigfile.xml', 'r') as xml:
    soup = BeautifulSoup(xml):
    for xmltag in soup('xmltag'):
        print xmltag.contents
eumiro
+1 - great answer.
duffymo
+2  A: 

Dive Into Python 3 have a great chapter about this:

It'a great free book about python, worth reading !

Alois Cochard
A: 
xml=open("xmlfile").read()
x=xml.split("</xmltag>")
for block in x:
    if "<xmltag>" in block:
        print block.split("<xmltag>")[-1]
ghostdog74
not really nice...
eumiro
nice is subjective ! the requirement is simple, using simple Python string methods is sufficient.
ghostdog74
OP did not state whether `xmltag` has some attributes.
eumiro
that's right. He did not state anything other than wanting to find the start and end of specified tags. With this information, my solution is simple and straightforward without having to download anything, for now at least.
ghostdog74
Then it's OK. But still subjectively not nice ;-)
eumiro
subjectively, its also not nice to me using a parser when its only simple requirement.
ghostdog74
No, this is objectively not nice. It's only a "simple requirement" because OP doesn't understand XML. Anything that perpetuates the idea that it's appropriate to use string manipulation to process XML is, like the idea itself, wrong unless and until specific details in the requirements indicate otherwise.
Robert Rossney
You are not the OP, it still may only be a simple requirement. All you really do is guess so what makes you think that all the answers requiring the use of parser are definitely correct?
ghostdog74
A: 

No need to install BeautifulSoup, Python includes the ElementTree parser in its standard library.

from xml.etree import cElementTree as ET
tree = ET.parse('myfilename')
new_tree = ET('new_root_element')
for element in tree.findall('.//xmltag'):
    new_tree.append(tree.element)
print ET.tostring(new_tree)
Daniel Roseman
+1  A: 

The BeautifulSoup answer is good but this executes faster and doesn't require an external library:

import xml.etree.cElementTree as ET
tree = ET.parse('xmlfile.xml')
results = (elem for elem in tree.getiterator('xmltag'))

# in Python 2.7+, getiterator() is deprecated; use tree.iter('xmltag')
ianmclaury