tags:

views:

290

answers:

3

The file format my application uses is Xml based. I just got a customer who has a botched xml file. The thing contains nearly 90,000 lines and for some reason there are about 20 "=" symbols randomly interspersed.

I get an XmlException for most of them with a line number and char position which allows me to find offending chars and remove them manually. I've just started writing a small app that automates this process, but I was wondering if there are better ways to repair damaged xml files.

Example of botched line:

<item name="InstanceGuid" typ=e_name="gh_guid" type_code="9">ee330f9f-a1e2-451a-8c6d-723f066a6bd4</item>
                             ↑ (this is supposed to be [type_name])
+1  A: 

You could search for any equal sign that isn't followed by a double quote. A regular expression (regex) would be pretty simple to write up.

Or you could just open the file in an advanced text editor and search by that same regex expression to find and replace/remove. Some text editors allow you to find/replace with regex, so you could search for any equal sign not followed by double quote and just remove it.

Of course, I'd keep a copy of the original since if you had equal signs in the inner XML then it might mess it up, etc.

Jim W
Thanks Jim, this will find most of the errors.
David Rutten
+1  A: 

Use a regular expression to clean the xml first.

something like:

s/([^\s"]+)=([^\s"]+="[^"]*")/\1\2/

Obviously this would need to be ported to your Regex engine of choice :)

OJ
Thanks OJ, seems more and more of my problems these days can be fixed by RegEx.
David Rutten
I advise applying the above statement with caution. ;)
TrueWill
Without a doubt. The goal was to give an idea, not a production quality implementation. Hence the statement "something like" :)
OJ
+1  A: 

In TextPad if you search using the regular expression =[^"] you will find any = signs not followed by a "

This should find the locations in the document where the rogue = signs have appeared. To replace them, first open the document in TextPad. Then press F8.

In the dialog enter the following:

Find what: =\([^"]\)

Replace with: \1

Check the "Regular expressions" box, select "All documents" and click "Replace All"

This should match all = that aren't followed by a " and replace the = with the symbol that did follow it.

typename="test" typ=ename="test"

will become

typename="test" typename="test"

Stuart Thompson