views:

281

answers:

4

I need to manage XML documents in Subversion but don't want to manage the formatting which may turn out differently depending on who is editing the file.

I see two solutions: Either format the file each time with a known formatting before checking in. Or give svn a diff program that actively dismisses formatting from the diff algorithm. Ultimately the diff should of course support three-way merge actively ignoring the XML formatting.

What do you recommend?

(The same reasoning usually applies to code source files, but the problem is more difficult.)

A: 

If by "formatting" you mean "whitespace" you can configure svn diff to ignore whitespace with the -w switch in the diff command:

$ svn diff -x -w [file]

See svn help diff for more information.

sirlancelot
I'm afraid that is not entirely enough. The document may be formatted to reside on a single line for example. Its a de-facto standard I think to respect new-lines in only <xsl:text/> nodes for some documents.
Hugo
Although that's a very good and simple answer of course.
Hugo
There are many other issues than whitespace, such as encoding. If you save in UTF-16 a document that was in UTF-8, it is the same infoset but every byte is different. See ftp://ftp.logilab.org/pub/xmldiff/ for a XML-aware diff program
bortzmeyer
I do not find a -w option in svn diff (see the manual you linked to). It requires an external diff like GNUdiff.
bortzmeyer
@bortzmeyer, the `-x` is for injecting arguments in to the diff program used. `-w` tells diff to ignore whitespace: http://www.linuxmanpages.com/man1/diff.1.php
sirlancelot
+1  A: 

Do you consider the following two xml fragments to be the same...?

Fragment1:

<foo xmlns="http://foo.com/foo"&gt;
    <bar>Hello</bar>
</foo>

Fragment2:

<ns1:foo xmlns:ns1="http://foo.com/foo"&gt;
    <ns1:bar>Hello</ns1:bar>
</ns1:foo>

... because if you do (as these fragments have the same xml infoset) then you need to consider writing your own diff tool.

Martin Peck
Writing your own diff tool would be stupid, there are already XML-aware diff tools such as xmldiff ftp://ftp.logilab.org/pub/xmldiff/. The original question was how to integrate them.
bortzmeyer
Good point. To an XML parser those are exactly the same. It depends though, if i want to version-control a change between the two formats. I wouldn't want to if it was an editor that made the change automatically but would want to if it was a human-edited XML file.
Hugo
+3  A: 

I don't have personal experience with such a setup.

For the second method (a custom diff), what I've found is an example, "API description for Netopeer repository library" which is a detailed description of a setup with Subversion and, among other things, xmldiff.

For the other approach, converting to a know format before storing in Subversion, I recommend Canonical XML as the format. The xmllint tool, for instance, can convert to this format:

% cat complique.xml
<?xml version="1.0" encoding="utf-8"?>
<toto   >
    <truc      a="1" >Machin &#x43; </truc >café</toto>

% xmllint --c14n complique.xml    
<toto>
    <truc a="1">Machin C </truc>café</toto>

To integrate with Subversion, you could test in pre-commit that the submitted file is equal to the canonical file. See the enforcer script for an example.

bortzmeyer
That's a really really good answer. Unfortunately the canonical format retains white space, which makes it useless for me. I bet it can be extended to canonicalize text nodes too. Then it will be perfect.
Hugo
Of course: in XML, you cannot know if the whitespace is significant or not without knowing the schema. Canonical XML is schema-less and so cannot canonicalize whitespace. You'll have to write your own canonicalizer.
bortzmeyer
A: 

Just get all your developers on the same page. They should all be coding to the same standard anyway. If your coders are all using different formatting standards, you've got bigger problems to solve.

KOGI
Thanks for your pragmatic answer. This is the easiest way to go, but will fail to serve me if it can not be enforced or if the tools producing the XML will not produce deterministic formatting.
Hugo
Not a bad idea and you *can* enforce it. Decide on a standard, document it and ask the programmers to follow it. Then, in pre-commit, test if the file is "canonical" and reject the commit if not.
bortzmeyer