tags:

views:

188

answers:

5

Hi!

The case: there is a large zipped xml file which need to be parsed by a .NET program. The main issue is the too big size of the file so it can not be loaded fully in the memory and unzipped.

The file need to be read part by part in a way that after unzipping this parts they are "consistent". If a part includes only half of a node it will not be possible to be parsed in any xml structure.

Every help will be appreciated. :)

Edit: The current solution extracts the whole zip file part by part and writes it as a xml file on the disk. Then reads and parses the xml. No better ideas so far from my site :).

+1  A: 

Haven't you tried DotNetZip Library (click on this link) ?

In reply to your recent edition::
What you are doing is the standard flow / way ..
As per my knowledge there are no alternatives for this.

infant programmer
+1  A: 

You could give SharpZipLib a try and then to use XmlReader to start parsing it.

Rubens Farias
A: 

Regarding your edit: Unless you actually want to have that xml file on disk(which could of course be the case in some scenarios), I would extract it to a MemoryStream instead.

Svish
Here is the problem - the file is too big for extracting in the memory. Imaging really big file...
Alex
Ah, that big :p Then I suppose not. Unless you could cook up some way of kind of just streaming the contents. So, kind of unzipping, reading, using, throwing away in a stream. But I don't know if you can do that with zip files or not...?
Svish
In fact it can be done with zip files, just don't know how much to read at a time to have a valid xml. In other way the algorithm you described breaks on using :).
Alex
Breaks on using?
Svish
Well, I mean following your algorithm:streaming the content - ok;unzipping - ok;reading - ok;using - problem, if the part we've read is not a valid xml.
Alex
Yeah, but you would just have to read until you have a valid chunk. And then stop reading, deal with that chunk, and then read a new chunk.
Svish
There are classes that allow you to read zips as streams, as for example with DotNetZip. Using those classes, there's no need for your code to make sure "the part you've read is valid xml". The XmlReader takes care of that for you. It reads until it gets what it needs. see the code I provided in my answer. http://stackoverflow.com/questions/2040824/read-a-zipped-xml-with-net/2042969#2042969
Cheeso
well there you go :)
Svish
A: 

Hmmm you have two problems here, unzipping the file in a manner that can give you chunks of data and a method to be able to read the XML based on being able to just read chunks at a time. This different to how most of us are used to dealing with XML where we just read it in one time into memory, but you say thats not an option.

This means you are going to have to use Streams which are build for just this case. This solution will work but it might be limited depending on what you are hoping to do with the XML data. You say it needs to be parsed but the only way you will be able to do that (as you can't keep it in memory) is to be able to read it in a "fire hose manner" stepping through each node as its parsed. Hopefull thats enough to be able to pull out what data you need or to process it however you need too (poke it into a DB, extract only the sections you are intested in and save them into a smaller in memory XML doc? etc.)

So first job, get a stream from your zip file, quite easy to do with SharpZipLib (+1 to Rubens). Add a reference to the SharpZipLib dll in your project. Heres some code that creates a stream from a zip and then adds it to a memory stream (you might not want to do that bit but it shows you how I use it to get back a byte[] of data, you just want the stream):

using System;
using System.IO;
using ICSharpCode.SharpZipLib.Zip;
using System.Diagnostics;
using System.Xml;

namespace Offroadcode.Compression
{
    /// <summary>
    /// Number of handy zip functions for compressing/decompressing zip data.
    /// </summary>
    public class Zip
    {

        /// <summary>
        /// Decompresses a btye array of previously compress data from the Compress method or any Zip program for that matter.
        /// </summary>
        /// <param name="bytes">Compress data as a byte array</param>
        /// <returns>byte array of uncompress data</returns>
        public static byte[] Decompress( byte[] bytes ) 
        {
            Debug.Write( "Decompressing byte array of size: " + bytes.Length  );

            using( ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream stream = new ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream( new MemoryStream( bytes ) ) ) 
            {
                                // Left this bit in to show you how I can read from the "stream" and save the data to another stream "mem"
                using ( MemoryStream mem = new MemoryStream() ) 
                {
                    int size = 0;
                    while( true ) 
                    {
                        byte[] buffer = new byte[4096];
                        size = stream.Read( buffer, 0, buffer.Length );

                        if ( size > 0 ) 
                        {
                            mem.Write( buffer, 0, size );
                        }
                        else
                        {
                            break;
                        }
                    }

                    bytes = mem.ToArray();
                }
            }

            Debug.Write( "Complete, decompressed size: " + bytes.Length );

            return bytes;
        }

Then if you follow this article: http://support.microsoft.com/kb/301228 from MS you should be able to merge the two lots of code and start reading your XML from a zip stream :)

Pete Duncanson
Yes, this code lets us decompress a file in the memory on separated parts, but still does not help us define the size of this parts. In the best case every part is a valid xml. Which is the bad moment...
Alex
Hmm "define the size", you can do that by defining the buffer size? I'm rather confused as to what the problem now is then. As a understood it you have one single huge XML file which can't possible fit into memory. This method allows you to process the whole file a chunk at a time but your code can treat it as one huge XML file, rattle its way through it all and do what ever needs doing as it comes across each and every node. Is that not what you what to do? If not please provide more details of what you are wanting to do to the XML or the make up of the XML.
Pete Duncanson
Also did you read the article from MS?
Pete Duncanson
+1  A: 

Using DotNetZip you can do this:

using (var zip = ZipFile.Read("c:\\data\\zipfile.zip"))
{
    using (Stream s = zip["NameOfXmlFile.xml"].OpenReader())
    {
        // Create the XmlReader object.
        using (XmlReader reader = XmlReader.Create(s))
        {
            while (reader.Read()) 
            {
                ....
            }
        }
    }
}
Cheeso
acceptable answer .. This is what I was mentioning about .. +1
infant programmer