views:

201

answers:

3

I need to generate an XML file and i need to stick as much data into it as possible BUT there is a filesize limit. So i need to keep inserting data until something says no more. How do i figure out the XML file size without repeatably writing it to file?

A: 

You can ask the XmlTextWriter for it's BaseStream, and check it's Position. As the other's pointed out, you may need to reserve some headroom to properly close the Xml.

jdv
In general, it will not be possible to properly close the XML. Just adding end tags for any open tags will not produce valid XML. There may be missing required elements.
John Saunders
I actually just tried this out and the base stream doesn't seem to be written to until you call `writer.Close();` - the stream position/length are always 0 in the VS2k8 debugger.
Jake
@John: true, unless the xml is very simple. The requirements smell like some kind of log file format to me, in which case it would work.
jdv
@Jake: Well, try `Flush()` on the xmlwriter.
jdv
+1  A: 

In general, you cannot break XML documents at arbitrary locations, even if you close all open tags.

However, if what you need is to split an XML document over multiple files, each of no more than a certain size, then you should create your own subtype of the Stream class. This "PartitionedFileStream" class could write to a particular file, up to the size limit, then create a new file, and write to that file, up to the size limit, etc.

This would leave you with multiple files which, when concatenated, make up a valid XML document.


In the general case, closing tags will not work. Consider an XML format that must contain one element A followed by one element B. If you closed the tags after writing element A, then you do not have a valid document - you need to have written element B.

However, in the specific case of a simple site map file, it may be possible to just close the tags.

John Saunders
I can only have one file. I am creating a sitemap. I'm considering only having the most recent X url elements to keep size down. Not the best solution but its probably much easier then size counting.
acidzombie24
@acidzombie24: so why is there a file size limit? If your site is large, then your sitemap will be large.
John Saunders
Further to that, arbitrarily truncating a site map would only serve to make the site more difficult for search engines to index and probably result in lower rankings over time. Seems like a silly idea to me.
Aaronaught
John Saunders: Sitemap has a limit of 50K and 10MB. @Aaronaught: You dont need to provide every link that every existed AFAIK. Just the current ones and the time.
acidzombie24
@acidzombie24: where does this limit come from? Google? If this is the limit, then don't make your site so large - break it into smaller sites, don't index the lower levels, whatever. But it makes no sense to break the sitemap at some arbitrary point.
John Saunders
Oh, I get it, it's a Google thing, SEO junk, the 50K is 50,000 distinct URLs. But I think if your site map is bigger than that, it's probably not a very well-designed site... either that or you're trying to include dynamic content in the sitemap, which is just insane.
Aaronaught
@Aaronaught: One of the reasons for sitemaps IS for dynamic content. and IIRC SO has a huge sitemap of its last 50k questions.
acidzombie24
@acidzombie24: I bet they simply make no attempt to write more than 50k entries into that sitemap.
John Saunders
Base on my math the text inside url and changefreq together must be < 208 bytes. My urls are long. I hope sitemaps are still valid if urls are redirected with 301 (i hear redirect, not 301 specifically are invalid/rejected)
acidzombie24
@acidzombie24: I'm not sure what you're responding to. I would still say, "so, don't write so many URLs".
John Saunders
+1  A: 

I agree with John Saunders. Here's some code that will basically do what he's talking about but as an XmlSerializer except as a FileStream and uses a MemoryStream as intermediate storage. It may be more effective to extend stream though.

public class PartitionedXmlSerializer<TObj>
{
    private readonly int _fileSizeLimit;

    public PartitionedXmlSerializer(int fileSizeLimit)
    {
        _fileSizeLimit = fileSizeLimit;
    }

    public void Serialize(string filenameBase, TObj obj)
    {
        using (var memoryStream = new MemoryStream())
        {
            // serialize the object in the memory stream
            using (var xmlWriter = XmlWriter.Create(memoryStream))
                new XmlSerializer(typeof(TObj))
                    .Serialize(xmlWriter, obj);

            memoryStream.Seek(0, SeekOrigin.Begin);

            var extensionFormat = GetExtensionFormat(memoryStream.Length);

            var buffer = new char[_fileSizeLimit];

            var i = 0;
            // split the stream into files
            using (var streamReader = new StreamReader(memoryStream))
            {
                int readLength;
                while ((readLength = streamReader.Read(buffer, 0, _fileSizeLimit)) > 0)
                {
                    var filename 
                        = Path.ChangeExtension(filenameBase, 
                            string.Format(extensionFormat, i++));
                    using (var fileStream = new StreamWriter(filename))
                        fileStream.Write(buffer, 0, readLength);
                }
            }
        }
    }

    /// <summary>
    /// Gets the a file extension formatter based on the 
    /// <param name="fileLength">length of the file</param> 
    /// and the max file length
    /// </summary>
    private string GetExtensionFormat(long fileLength)
    {
        var numFiles = fileLength / _fileSizeLimit;
        var extensionLength = Math.Ceiling(Math.Log10(numFiles));
        var zeros = string.Empty;
        for (var j = 0; j < extensionLength; j++)
        {
            zeros += "0";
        }
        return string.Format("xml.part{{0:{0}}}", zeros);
    }
}

To use it, you'd initialize it with the max file length and then serialize using the base file path and then the object.

public class MyType
{
    public int MyInt;
    public string MyString;
}

public void Test()
{
    var myObj = new MyType { MyInt = 42, 
                             MyString = "hello there this is my string" };
    new PartitionedXmlSerializer<MyType>(2)
        .Serialize("myFilename", myObj);
}

This particular example will generate an xml file partitioned into

myFilename.xml.part001
myFilename.xml.part002
myFilename.xml.part003
...
myFilename.xml.part110
James Kolpack
I think everyone misunderstood what i meant but your solution is definitely worth the read.
acidzombie24