ansaurus

Question

Answer 1

+2 A:

Is the same code producing the whole file - if so then use an xml library to generate it then all tags will be nested correctly - if not fix the code producing it so that it is valid XML.

regexes and xml do not go together well.

Mark 2010-06-24 06:17:01

The code that I am using generates a bunch of Sentence tags. I am trying to simply put a root tag around so that it becomes a valid xml. Placing the `<Story` tag was no bid deal. I am stuck with its closing tag.

afs 2010-06-24 06:28:16

@afs: some reason you can't use `'<Story>' + sentences + '</Story>'`?

David Zaslavsky 2010-06-24 06:32:02

Answer 2

+1 A:

You really should use a parser like BeautifulSoup to do the job. BeautifulSoup can parse very incorrect HTML/XML and tries to make them look correct. Your code could look like this (I'm assuming you have some tags before and after your incorrect Story tag, or else you would follow the advice from David's comment):

from BeautifulSoup import BeautifulStoneSoup

html = '''
<Document>
<PrevTag></PrevTag>
<Story>
 <Sentence id="1"> some text </Sentence>   
 <Sentence id="2"> some text </Sentence>   
 <Sentence id="3"> some text </Sentence>
<EndTag></EndTag>
</Document> 
'''
# Parse the document:
soup = BeautifulStoneSoup(html)

Look how BeautifulSoup parsed it:

print soup.prettify()

#<document>
# <prevtag>
# </prevtag>
# <story>
#  <sentence id="1">
#   some text
#  </sentence>
#  <sentence id="2">
#   some text
#  </sentence>
#  <sentence id="3">
#   some text
#  </sentence>
#  <endtag>
#  </endtag>
# </story>
#</document>

Notice that BeautifulSoup closed the Story right before the closing of the tag that surrounded it (Document), so you have to move the closing tag next to the last sentence.

# Find the last sentence:
last_sentence = soup.findAll('sentence')[-1]

# Find the Story tag:
story = soup.find('story')

# Move all tags after the last sentence outside the Story tag:
sib = last_sentence.nextSibling
while sib:
    story.parent.append(sib.extract())
    sib = last_sentence.nextSibling

print soup.prettify()

#<document>
# <prevtag>
# </prevtag>
# <story>
#  <sentence id="1">
#   some text
#  </sentence>
#  <sentence id="2">
#   some text
#  </sentence>
#  <sentence id="3">
#   some text
#  </sentence>
# </story>
# <endtag>
# </endtag>
#</document>

The end result should be exactly what you wanted. Note that this code assumes there is only one Story in the document -- if not, it should be modified slightly. Good luck!

DzinX 2010-06-24 07:20:00

Answer 3

A:

If all you need is to find the last occurrence of the tag, you can:

reSentenceClose= re.compile('</Sentence> *')
match= None
for match in reSentenceClose.finditer(your_text):
    pass

if match: # it was found
    print match.end() # the index in your_text where the pattern was found

ΤΖΩΤΖΙΟΥ 2010-06-25 12:02:53

Answer 4

A:

Why not match all three (or however many) <Sentence> elements and plug them back in with a group reference?

re.sub(r'(?:(\r?\n) *<Sentence.*?</Sentence> *)+',
       r'$0$1</Story>',
       line)

Alan Moore 2010-06-25 13:34:33

ansaurus

tags:

views:

answers:

Capture the last occurrence of a tag

related questions