tags:

views:

71

answers:

3

Hey gurus,

I'm trying to copy sections in a file within a set of XML tags

> <tag>I want to copy the data here</tag>`
  • Please note I found out the data around the tags is not valid XML so I can't import a normal library and have to find it via string comparison :( *

There are multiple sections of text I want to extract in the file so I'm trying to loop through the file to find each one. I just wanted to do this on a line-by-line basis till I figured out how to parse the lines of unwanted text and created the following code:

InputFile=open('xml_input_File.xml','r')
OutputFile=open('xml_output_file.xml', 'w')
check = 0

for line in InputFile.readlines():
      if line.find("<STARTTAG>"):
          check = 1
      elif line.find(r"<//STARTTAG>"):
          check = 0
      if check == 1:
          OutputFile.write(line)

The problem I'm having is it simply copies the whole file and not just the sections I would like.

I know the code is not very pretty but I'm still learning and its going to be a "d'oh!" moment but thanks for your help!!

Cheers

+1  A: 

There's a few issues with your code:

  • If input is really in the format of "<STARTTAG> ... </STARTTAG>", capturing lines isn't going to cut it as you're going to grab at least the <STARTTAG> instance.
  • You're using a literal string prefix (r"<//STARTTAG>") but you're using two forward slashes. From your example above, it looks like the closing tags only have one forward slash. I'm not sure why you need to use the literal string prefix at all here. If this is incorrect, that's probably why the check variable is never set to 0 (hence, the code copies the whole file).

Edit: the point other posters have made about the return value of find() is very valid as well. Using the in keyword is likely a better bet.

You need to look into splitting up your input (parsing), either manually (via split()) or by some regular expressions. Alternatively, you could try and groom your input into a compliant XML format and then use one of the many freely available libraries to handle this sort of thing.

Hope this helps!

Joshua Barron
A: 
Help on method_descriptor:

find(...)
    S.find(sub[, start[, end]]) -> int

    Return the lowest index in S where substring sub is found,
    such that sub is contained within s[start:end].  Optional
    arguments start and end are interpreted as in slice notation.

    Return -1 on failure.

-1 is also a True value.

try:

if "<STARTTAG>" in line:

etc.

Also, forward slash doesn't need to be escaped (even less in raw strings!).

fortran
A: 

find returns index of substring in the line. Probably that starttag is on the beginning of the line (index is zero) so if doesn't work as it should.

Try:

if line.find("<STARTTAG>") != -1:

or even better

if "<starttag>" in line:

or use some XML parser for python.

Klark