tags:

views:

16

answers:

2

Hello,

I've got a big XML file I'm editing with BBEdit.

Within the XML file, which is a digital recreation of an old diary, is text that is enclosed in note tags.

<note>Example of a note.</note>

Some note tags, however, have quotations enclosed in quote tags nested in them.

<note>Example of a note, but <quote>"here is a quotation within the note"</quote></note>

I need to remove all instances of quote from the note tags, whilst keeping the actual content of the quote tags. So the example would become:

<note>Example of a note, but "here is a quotation within the note"</note>

I've used GREP in BBEdit to successfully remove some of these, but I'm beginning to get stuck with the more complicated note tags that go over several lines or have text between the two different sets of tags. For example:

<note>Example of a note, <quote>"with a quotation"</quote> and a <quote>"second quotation"</quote> along with some text outside of the quotation before the end of the note.</note>

Some quotations can go on for over 10 lines. Using \r in my regex doesn't seem to help.

I should also say that quote tags can exist outside of note tags, which rules out the possibility of just bulk finding /?quote and deleting it. I still need to use the quote tags within the document, just not within note tags.

Many thanks for any help.

+2  A: 

This is really easy with XSLT:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;

  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*" />
    </xsl:copy>
  </xsl:template>

  <xsl:template match="quote">
    <xsl:apply-templates select="node()|@*" />
  </xsl:template>
</xsl:stylesheet>

Apply this stylesheet to your XML file with an XSLT processor of your choice. There are tools that operate on the command line, for example.

Tomalak
A: 

Without restrictions on how the XML is formed, I'm pretty sure that this goes out of the scope of regular languages and into context-free ones, which means regular expressions are not going to help you. If the structure of the XML is simple (no nodes nested in nodes or quotes nested in quotes), you might be able to do something along the lines of a global replace of <node>(!</node>)<quote>(!</quote>)</quote>(!</node>)</node> with <node>\1\2\3</node>, but you're probably using the wrong tool for the job. As one of the other answers notes, XSLT could help you, or you could use an XML parsing library to write a simple program to strip out the tags you're looking for.

ngroot