ansaurus

Question

Repairing broken XML file - removing extra less-than/greater-than signs

Answer 1

+2 A:

You could do a regular expression search-and-replace, looking for <(?=[^<>]*<) and replacing with <.

In Ruby,

result = subject.gsub(/<(?=[^<>]*<)/, '&lt;')

The rationale behind this being that you want to find < that don't have a corresponding >. Therefore, the regex only matches a < if it is followed by another without any > in-between.

EDIT: Improved the regex by using lookahead. I first thought Ruby didn't support lookahead, but it does. Just not lookbehind...

Tim Pietzcker 2010-03-26 08:44:00

Thanks a lot, it works great! Of course just after posting the question I came up with the idea of using a regular expression but it would have taken me the whole day to figure out the correct one so thank you :)

peku 2010-03-26 09:00:06

I'm not sure if it's a concern, but embedded CDATA will be munged... subject = '<ArticleName><![CDATA[ a < b ]]></ArticleName>'; result = subject.gsub(/<(?=[^<>]*<)/, '<') # => "<ArticleName><![CDATA[ a < b ]]></ArticleName>"

Greg 2010-03-26 10:29:38

Regex as the accepted answer?? The `<center>` cannot hold it is too late http://stackoverflow.com/questions/1732348#1732454

Andrew Grimm 2010-03-30 00:50:41

@Andrew: Yes, the center cannot hold. But I think in this case since XML parsers choke and die on the malformed input, using a regex to clean it up is not the worst possible choice. The only caveat has been pointed out by Z.E.D., and if that's not a problem with the real input, then why not use a regex?

Tim Pietzcker 2010-03-30 06:06:26

Check out my updated sample for Nokogiri. *SOME* XML parsers are written so they can deal with real-world XML. :-)

Greg 2010-03-31 07:19:36

@Andrew: As stated, the solution worked for me so I'm not going to change the accepted answer.

peku 2010-03-31 11:00:22

Answer 2

+2 A:

Nokogiri supports some options for handling bad XML. These might help:

http://rubyforge.org/pipermail/nokogiri-talk/2009-February/000066.html http://nokogiri.org/tutorials/ensuring_well_formed_markup.html

I just messed around with the broken fragment and Nokogiri handles it very nicely:

#!/usr/bin/ruby

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::XML('<?xml version="1.0"?><ArticleName>Article 1 <START  </ArticleName></xml>')
doc.to_s  # => "<?xml version=\"1.0\"?>\n<ArticleName>Article 1 <START/></ArticleName>\n"
doc.errors # => [#<Nokogiri::XML::SyntaxError: error parsing attribute name

Greg 2010-03-26 09:45:31

+1 for a more robust solution than regex.

Tim Pietzcker 2010-03-31 08:25:47

Thank you for this input also, seems to be a more correct way to do things even though the regexp-answer worked in my case.

peku 2010-03-31 10:58:53

ansaurus

tags:

views:

answers:

Repairing broken XML file - removing extra less-than/greater-than signs

related questions