tags:

views:

23

answers:

1

I have an XML file (actually a Visual C# project file) that I want to manipulate using a Ruby script. I want to read the XML into memory, do some work on them that includes changing some attributes and some text (fixing up some path references), and then write the XML file back out. This isn't so hard.

The hard part is, I want the file I write to look the same as the file I read in, except where I made changes. If the input file used double quotes, I want the output to use double quotes. If the input had a space before />, I want the output to do the same. Basically, I want the output to be the same as the input, except where I explicitly made changes (which, in my case, will only be to attribute values, or to the text content of an element).

I want minimal diffs because this project file is checked into version control -- and because the next time I make a change in Visual Studio, it's going to rewrite it in its preferred format anyway. I want to avoid checking in a bunch of meaningless diffs that will then be changed back again in the near future. I also want to avoid having to open the project in Visual Studio, make a change, and save, before I can commit my Ruby script's changes. I want my Ruby script to just make its changes, nothing more.

I originally just parsed the file with regexes, but ran into cases where I really needed an XML library because I needed to know more about child elements. So I switched to REXML. But it makes the following undesirable changes to my formatting:

  • It changes all the attributes from double quotes to single quotes.
  • It escapes all the apostrophes inside attribute values (changing them to ').
  • It removes the space before />.
  • It sorts each element's attributes alphabetically, rather than preserving the original order.

I'm working around this by doing a bunch of gsub calls on REXML's output, but is there a Ruby XML-manipulation library that's a better fit for "minimal diff" scenarios?

+1  A: 

You can build your own SAX parser (using Nokogiri, for example, it's very easy and I recommend to use it) to parse your XML file, change some data in it, and flush the processed XML file with your own customized, built from scratch, XML generator. The bad news is, you have to build a tiny XML library and generator routine in this case, so it is not an ordinary task.

Another way: don't build the SAX parser, but write an XML generator. Parse XML with your favourite library, change what you need to change and generate anything you want. You just need to recursively walk through all nodes in your document and output them within your conventions.

floatless
I need to be able to look ahead (at an element's children) in order to decide what I need to rewrite, so using a streaming API wouldn't be enough by itself. I would need to load the document, make a plan of what I want to rewrite, and then do another pass to rewrite the file using the streaming API. It could probably be made to work, but it's on the painful side of what I was hoping for.
Joe White