views:

377

answers:

14

Is there any reason why XML such as this :

<person>    
    <firstname>Joe</firstname>    
    <lastname>Plumber</lastname>
</person>

couldn't be compressed like this for client/server transfer.

<person>    
    <firstname>Joe</>    
    <lastname>Plumber</>
</>

It would be smaller - and slightly faster to parse.

Assuming that there are no edge conditions meaning this wouldn't work - are there any libraries to do such a thing?

This is a hard thing to google it turns out :

Your search - </> - did not match any documents.

Suggestions:

Try different keywords.

Edit: Seems to be confusion in what I'm asking. I am talkin about my own form of compression. I am fully aware that as it stands this is NOT XML. The server and client would have to be 'in on the scheme'. It would be especially helpful for schemas that have very long element names, becuase the bandwidth taken up by those element names would be halved.

+5  A: 

That's not valid XML. Closing tags must be named. It's potentially error prone otherwise and frankly I think it'd be less readable your way.

In reference to your clarification about this being a nonstandard violation of the XML standard to save a few bytes, it is an incredibly bad idea for several reasons:

  1. It's nonstandard and possibly will have to be supported far in the future;
  2. Standards exist for a reason. Standards and conventions have a lot of power and having "custom XML" ranks up there with Ivory Tower graphic designers who force programmers to write a custom button replacement because the standard one can't do whatever weird, wonderful and confusing behaviour was dreamt up;
  3. Gzip compression is easy and far more effective and won't break standards. If you see a gzip octet stream, there's no mistaking it for XML. The real problem with the shorthand scheme you've got is that it still has at the top so some poor unsuspecting parser may make the mistake of thinking its valid and bomb out with a different, misleading error;
  4. Information theory: compression works by removing redundancy of information. If you do that by hand, it makes gzip compression no more effective because the same amount of information is represetned;
  5. There is a significant overhead on converting documents to and from this scheme. It can't be done with a standard XML parser so you'd have to effectively write your own XML parser and outputter that understands this scheme (actually conversion to this format can be done with a parser; getting it back is more difficult), which is a lot of work (and a lot of bugs).
cletus
@cletus obviously its not valid XML, and thats why i'm calling it compression. thats why i asked for a library - which would have to know both ends. its not supposed to be readable. and i know there are PLENTY of ways to compress data but i just randomly thought of this and had never heard of it
Simon_Weaver
Yeap, cannot tell which tag is being ended.
icelava
@icelava but you'd be closing the previous tag. tags are nested. thats how XML works
Simon_Weaver
@Simon: you want compression in XML? Use gzip. Cutting a few characters off closing tags is nothing compared to the 95-95% (typical) compression rate you get with gzip (or equivalent) on plaintext XML.
cletus
@cletus i'm playin devils advocate a little with this question. definitely gzip is much more important. i'd be curious actually how much difference it would make to remove all the end tags. 2% / 5%? definitely somewhat negligigle
Simon_Weaver
while at it try changing person = p, firstname = f ..
Orkun Balkancı
A: 

Sorry, not in the spec. If you have a big XML file you better compress via zip, gzip and such.

+5  A: 

If you need better compression and easier parsing, you may try using XML attributes:

<person firstname="Joe" lastname="Plumber" />
Boris Pavlović
thanks. this was just the shortest piece of xml i could come up with. i much prefer use of attributes for logical properties belonging to the element. definitely helps with compression too
Simon_Weaver
+1  A: 

You may be interested to read about the different tag formats in SGML. For example, the following could be valid SGML:

<p/This paragraph contains a <em/bold/ word./

Fortunately, the designers of XML chose to omit this particular chapter of madness.

Greg Hewgill
A: 

Is there any reason you aren't using YAML or JSON?

Karan
no. im just talking theoretically but thanks for the mention
Simon_Weaver
+2  A: 

Even if this were possible it could only take longer to parse because now the parser has to work out what's being closed and will have to keep checking if that's correct.

If you want compression, XML is highly gzip'able.

annakata
Actually 100% wrong - an XML parser has to check if the closing tag matches the top of the parse stack.. THis scheme doesn't have to check, </> always closes the top of the parse stack.
MSalters
But how does it know if that's correct? If you get a mismatch in opens/closes it's impossible to tell where that happened.
annakata
A: 

Yes, xml is a kind og heavy format. But it has certain advantages.

If you think xml is to heavy for your use, have a look at JSON instead. It is light weight but has less functionality than xml.

And if you want really small files, use a binary format ;-).

Gamecat
+8  A: 

If you wrote a compression routine which did that, then yes, you could compress a stream and restore it at the other end.

The reasons this isn't done are:

  • much better XML agnostic compression schemes already exist (in terms of compression ratio, and probably in terms of CPU and space - a certain 7 N UTF-8 document would get 14% compression but require at least 2 N bytes space to decompress, rather than constant space required by most decompression algorithms.
  • much better XML aware compression schemes already exist (google 'binary xml'). For schema aware compression, the schemes based on ASN.1 give much better than reducing the size devoted to indicating element type by half.
  • the decompressor must parse the non-standard XML and keep a stack of the open tags it has encountered. So unless you're plugging it in instead of a parser, you have doubled the parsing cost. If you do plug it instead of the parser, you're mixing a different layers, which is liable to cause confusion at some point
Pete Kirkham
+1 for good reasoning
David Schmitt
+4  A: 

If size of the data is any issue at all, XML is not for you.

peterchen
+5  A: 

As you say, this isn't XML, so why make it even look like XML? You've already lost the ability to use any XML parsers or tools. I would either

  • Use XML, and compress it on the wire as you'll see far greater savings than with your own scheme
  • Use another more compact format like YAML or JSON
Paul Dixon
very good point. its at least humanly readable, but then who cares right
Simon_Weaver
YAML is readable too though...
Paul Dixon
although it is directly mappable to xml at time of parsing (assuming you're building a DOM), but perhaps that is a moot point
Simon_Weaver
A: 

If not using gzip or anything like that, I'd simply replace each tag with a shorter tagname before sending and before using the xml on the recieving end. Thus you'd get something like this:

<a>
    <b>Joe</b>
    <c>Plumber</c>
</a>

Making it very easy to use any standard parser to iterate through all nodes and replacing nodeNames accordingly.

svinto
A: 

Do not bother with in-text optimizations of your XML and degrading reading/writing perf/simplicity. Use deflate compression to compress your payload between the client and the server. I made some tests, and compressing a normal 10k XML file results in a 2.5k blub. Removing all endpoint end tag names lowers the original file size to 9k, but once deflated it's again 2.5k. This is a very good example that dictionary-based compression is the simple way to compress payloads between endpoints. "" and "" will (almost) use the same space in the compressed data.

The only exception would be if the files/data is very small, then less compressible.

Martin Plante
+3  A: 

What you are describing is SGML, which uses </> to end nearest previous nonempty tag.

dalle
+4  A: 

Is there any reason why

Taking your question philosophically, SGML did allow </> close tags. There was debate about allowing this into the XML standard. The reasoning for rejecting it was that omitting the names from end tags would sometimes result in less readable XML. So, that is a "reason why".

It's hard to beat existing text compression rates, but one advantage of your "compression" scheme is the XML remains human readable on the wire. Another advantage is that if you have to enter XML by hand (e.g. for testing), it's a (minor) convenience to not have to close end tags. That is, it's more human writable than standard XML. I say "minor", because most editors will do string completion for you (e.g. ^n and ^p in vim).

To strip the close tags: simplest is to use something like this: s_</[a-zA-Z0-9_$]+>_</>_ (that's not the right QName regex, but you get the idea).

To add them back: you need a special parser, because SAX and other XML parsers won't recognize this (as it's not "XML"). But the (simplest) parsing just needs to recognize open tag names and close tag names.

have a stack.
scan the XML, and output it, as-is.
if you recognize an open tag, push its name.
if you recognize close tag, pop to get its name, and
  insert that in the output (you can do this even when there is a proper close tag).

BTW (in response to a comment above), this works because in XML a close tag can only ever correspond to the most recent open tag. Same as nested parentheses.

However, I think you're right, that someone has surely done this already. Maybe check Python or Perl repositories?

EDIT: You can further omit trailing </>, so your example becomes (when the parser sees EOF, it adds close tags for whatever's left on the stack):

<person>    
    <firstname>Joe</>    
    <lastname>Plumber
13ren
Great historical info, thanks
Thomas Ahle