tags:

views:

77

answers:

2

I have a large XML file which in the middle contains the following:

<ArticleName>Article 1 <START  </ArticleName>

Obviously libxml and other XML libraries can't read this because the less-than sign opens a new tag which is never closed. My question is, is there anything I can do to fix issues like this automatically (preferably in Ruby)? The solution should of course work for any field which has an error like this. Someone said SAX parsing could do the trick but I'm not sure how that would work.

+2  A: 

You could do a regular expression search-and-replace, looking for <(?=[^<>]*<) and replacing with &lt;.

In Ruby,

result = subject.gsub(/<(?=[^<>]*<)/, '&lt;')

The rationale behind this being that you want to find < that don't have a corresponding >. Therefore, the regex only matches a < if it is followed by another without any > in-between.

EDIT: Improved the regex by using lookahead. I first thought Ruby didn't support lookahead, but it does. Just not lookbehind...

Tim Pietzcker
Thanks a lot, it works great! Of course just after posting the question I came up with the idea of using a regular expression but it would have taken me the whole day to figure out the correct one so thank you :)
peku
I'm not sure if it's a concern, but embedded CDATA will be munged... subject = '<ArticleName><![CDATA[ a < b ]]></ArticleName>'; result = subject.gsub(/<(?=[^<>]*<)/, '<') # => "<ArticleName><![CDATA[ a < b ]]></ArticleName>"
Greg
Regex as the accepted answer?? The `<center>` cannot hold it is too late http://stackoverflow.com/questions/1732348#1732454
Andrew Grimm
@Andrew: Yes, the center cannot hold. But I think in this case since XML parsers choke and die on the malformed input, using a regex to clean it up is not the worst possible choice. The only caveat has been pointed out by Z.E.D., and if that's not a problem with the real input, then why not use a regex?
Tim Pietzcker
Check out my updated sample for Nokogiri. *SOME* XML parsers are written so they can deal with real-world XML. :-)
Greg
@Andrew: As stated, the solution worked for me so I'm not going to change the accepted answer.
peku
+2  A: 

Nokogiri supports some options for handling bad XML. These might help:

http://rubyforge.org/pipermail/nokogiri-talk/2009-February/000066.html http://nokogiri.org/tutorials/ensuring_well_formed_markup.html

I just messed around with the broken fragment and Nokogiri handles it very nicely:

#!/usr/bin/ruby

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::XML('<?xml version="1.0"?><ArticleName>Article 1 <START  </ArticleName></xml>')
doc.to_s  # => "<?xml version=\"1.0\"?>\n<ArticleName>Article 1 <START/></ArticleName>\n"
doc.errors # => [#<Nokogiri::XML::SyntaxError: error parsing attribute name

Greg
+1 for a more robust solution than regex.
Tim Pietzcker
Thank you for this input also, seems to be a more correct way to do things even though the regexp-answer worked in my case.
peku