views:

44

answers:

4

My text is of the form:

<Story>
 <Sentence id="1"> some text </Sentence>   
 <Sentence id="2"> some text </Sentence>   
 <Sentence id="3"> some text </Sentence>   

My task is to insert a closing tag </Story> after the last </Sentence>. In the text, every </Sentence> is followed by 3 spaces. I tried capturing the last </Sentence> using the regex </Sentence>(?!.*<Sentence) and used re.DOTALL too. But its not working.

Actual code used is
line = re.sub(re.compile('</Sentence>(?!.*<Sentence)',re.DOTALL),'</Sentence></Story>',line)

Please help. Thanks.

+2  A: 

Is the same code producing the whole file - if so then use an xml library to generate it then all tags will be nested correctly - if not fix the code producing it so that it is valid XML.

regexes and xml do not go together well.

Mark
The code that I am using generates a bunch of Sentence tags. I am trying to simply put a root tag around so that it becomes a valid xml. Placing the `<Story` tag was no bid deal. I am stuck with its closing tag.
afs
@afs: some reason you can't use `'<Story>' + sentences + '</Story>'`?
David Zaslavsky
+1  A: 

You really should use a parser like BeautifulSoup to do the job. BeautifulSoup can parse very incorrect HTML/XML and tries to make them look correct. Your code could look like this (I'm assuming you have some tags before and after your incorrect Story tag, or else you would follow the advice from David's comment):

from BeautifulSoup import BeautifulStoneSoup

html = '''
<Document>
<PrevTag></PrevTag>
<Story>
 <Sentence id="1"> some text </Sentence>   
 <Sentence id="2"> some text </Sentence>   
 <Sentence id="3"> some text </Sentence>
<EndTag></EndTag>
</Document> 
'''
# Parse the document:
soup = BeautifulStoneSoup(html)

Look how BeautifulSoup parsed it:

print soup.prettify()

#<document>
# <prevtag>
# </prevtag>
# <story>
#  <sentence id="1">
#   some text
#  </sentence>
#  <sentence id="2">
#   some text
#  </sentence>
#  <sentence id="3">
#   some text
#  </sentence>
#  <endtag>
#  </endtag>
# </story>
#</document>

Notice that BeautifulSoup closed the Story right before the closing of the tag that surrounded it (Document), so you have to move the closing tag next to the last sentence.

# Find the last sentence:
last_sentence = soup.findAll('sentence')[-1]

# Find the Story tag:
story = soup.find('story')

# Move all tags after the last sentence outside the Story tag:
sib = last_sentence.nextSibling
while sib:
    story.parent.append(sib.extract())
    sib = last_sentence.nextSibling

print soup.prettify()

#<document>
# <prevtag>
# </prevtag>
# <story>
#  <sentence id="1">
#   some text
#  </sentence>
#  <sentence id="2">
#   some text
#  </sentence>
#  <sentence id="3">
#   some text
#  </sentence>
# </story>
# <endtag>
# </endtag>
#</document>

The end result should be exactly what you wanted. Note that this code assumes there is only one Story in the document -- if not, it should be modified slightly. Good luck!

DzinX
A: 

If all you need is to find the last occurrence of the tag, you can:

reSentenceClose= re.compile('</Sentence> *')
match= None
for match in reSentenceClose.finditer(your_text):
    pass

if match: # it was found
    print match.end() # the index in your_text where the pattern was found
ΤΖΩΤΖΙΟΥ
A: 

Why not match all three (or however many) <Sentence> elements and plug them back in with a group reference?

re.sub(r'(?:(\r?\n) *<Sentence.*?</Sentence> *)+',
       r'$0$1</Story>',
       line)
Alan Moore