views:

69

answers:

2

I have written a simple C++ shell program to parse large XML files and fix syntax errors.

I have so far covered everything I can think of except strings within strings, for example.

<ROOT>
  <NODE attribute="This is a "string within" a string" />
<ROOT>

My program loops through the entire xml file character by character(keeping only a few characters in memory at a time for efficiency), it looks for things such as &<> etc and escapes them with &amp; &gt; &lt; etc. A basic example of what I am doing can be found at the accepted answer for this http://stackoverflow.com/questions/1817073/escaping-characters-in-large-xml-files

The question is: What conditions or logic can I use to detect "string within" to be able to escape the quotes to this:

<ROOT>
  <NODE attribute="This is a &quot;string within&quot; a string" />
<ROOT>

Is it even possible at all?

+1  A: 

I think it's difficult to decide where the attribute ends and another begins. I think you need to restrict the possible input you can parse otherwise you will have ambiguous cases such as this one:

<ROOT>
  <NODE attribute="This is a "string within" a string" attribute2="This is another "string within" a string" />
<ROOT>

These are either two attributes or one attribute.

One assumption you could make is that after an equal number of double quotes and an equal sign a new attribute begins. Then you simply replace all the inner double quotes with your escape string. Or any equal sign after 2 ore more double quotes means new attribute. The same could be assumed for the end of node.

Sebastian
+3  A: 

The better solution would be to fix these kind of errors before they are created. XML is designed to be super strict to avoid having to make these kind of guesses. If the XML is invalid, the only thing you should do, is reject it, and output a helpful error message.

Who's to say that your correction:

<NODE attribute="This is a &quot;string within&quot; a string" />

is better than

<NODE attribute="This is a " string-within=" a string" />

Obviously, with the benefit of understanding English, we can be pretty certain that it's the former, but when you're taking an automated approach to it, there's no way to be certain that you're not covering up a more serious error.

The place to fix escaping issues is when you're creating the xml file.

Eclipse
The problem is I have no power over how the XML is generated, it's given to me like this and I have to fix it myself. It's not a big problem but I'd like to automate as much as I can.
Grym