views:

160

answers:

5

I did some quick searching on the site and couldn't seem to find the answer I was looking for so that being said, what are some best practices for passing large xml files across a network. My thoughts on the matter are to stream chunks across the network in manageable segments, however I am looking for other approaches and best practices for this. I realize that large is a relative term so I will let you choose an arbitrary value to be considered large.

In case there is any confusion the question is "What are some best practices for sending large xml files across networks?"

Edit:

I am seeing a lot of compression being talked about, any particular compression algorithm that could be utilized and in terms of decompressing said files? I do not have much desire to roll my own when I am aware there are proofed algorithms out there. Also I appreciate the responses so far.

+1  A: 

Compression is an obvious approach. This XML bugger will shrink like there is no tomorrow.

Hamish Grubijan
+2  A: 

Depending on how large it is, you might want to considering compressing it first. This, of course, depends on how often the same data is sent and how often it's changed.

To be honest, the vast majority of the time, the simplest solution works fine. I'd recommend transmitting it the easiest way first (which is probably all at once), and if that turns out to be problematic, keep on segmenting it until you find a size that's rarely disrupted.

Eli
Lets say for the sake of my learning proper approaches I have a large file that is changed constantly and a large file that is changed rarely.
Woot4Moo
Testing is really the best way to see what works best. Gzip then send, repeat 1000 times, see what the total time is. Compare to sending without zipping. Also be sure to account for transmission errors.
Eli
A: 

If you can keep a local copy and two copies at the server, you could use diffxml to reduce what you have to transmit down to only the changes, and then bzip2 the diffs. That would reduce the bandwidth requirement a lot, at the expense of some storage.

Andrew McGregor
A: 

Are you reading the XML with a proper XML parser, or are you reading it with expectations of a specific layout?

For XML data feeds, waiting for the entire file to download can be a real waste of memory and processing time. You could write a custom parser, perhaps using a regular expression search, that looks at the XML line-by-line if you can guarantee that the XML will not have any linefeeds within tags.

If you have code that can digest the XML a node-at-a-time, then spit it out a node-at-a-time, using something like Transfer-Encoding: chunked. You write the length of the chunk (in hex) followed by the chunk, then another chunk, or "0\n" at the end. To save bandwidth, gzip each chunk.

Shanti
Sorry had to downvote because of the suggestion of a regex on xml which as I have seen is a bad idea. Thanks for the second bit though.
Woot4Moo
+2  A: 

Compressing and reducing XML size has been an issue for more than a decade now, especially in mobile communications where both bandwidth and client computation power are scarce resources. The final solution used in wireless communications, which is what I prefer to use if I have enough control on both the client and server sides, is WBXML (WAP Binary XML Spec).

This spec defines how to convert the XML into a binary format which is not only compact, but also easy-to-parse. This is in contrast to general-purpose compression methods, such as gzip, that require high computational power and memory on the receiver side to decompress and then parse the XML content. The only downside to this spec is that an application token table should exist on both sides which is a statically-defined code table to hold binary values for all possible tags and attributes in an application-specific XML content. Today, this format is widely used in mobile communications for transmitting configuration and data in most of the applications, such as OTA configuration and Contact/Note/Calendar/Email synchronization.

For transmitting large XML content using this format, you can use a chunking mechanism similar to the one proposed in SyncML protocol. You can find a design document here, describing this mechanism in section "2.6. Large Objects Handling". As a brief intro:

This feature provides a means to synchronize an object whose size exceeds that which can be transmitted within one message (e.g. the maximum message size – declared in MaxMsgSize element – that the target device can receive). This is achieved by splitting the object into chunks that will each fit within one message and by sending them contiguously. The first chunk of data is sent with the overall size of the object and a MoreData tag signaling that more chunks will be sent. Every subsequent chunk is sent with a MoreData tag, except from the last one.

Amir Moghimi
As this was the most descriptive and informative post I selected this as the answer.
Woot4Moo