ansaurus

Question

Answer 1

A:

Take a look to this project which splits XML files into smaller ones to conquer your issue:

Split large XML files into small files: http://www.codeproject.com/KB/XML/SplitLargeXMLintoSmallFil.aspx

atconway 2010-08-06 17:16:23

I've already looked at that and ruled it out. The source code is not supplied in the download. The code displayed is not complete (method calls off to code not supplied). I'm also not sure of the speed advantage of that method vs how I'm already splitting the xml.

Josh 2010-08-06 17:22:31

Answer 2

+2 A:

You're going to want a SAXReader for handling large XML files.

peterJ 2010-08-06 17:32:16

Answer 3

+1 A:

Have you looked into using a SAX parser? There isn't one distributed by Microsoft, but there are a handful of examples on the web. With a SAX parser, you essentially read the file as a stream and events fire that you can listen for vs loading the whole thing into an in-memory DOM which you obviously can't do. I don't know too much about using SAX parsers, so I don't have a specific recommendation, but a lot of Java folks have done XML this way for years.

mattmc3 2010-08-06 17:37:42

XmlReader as used in the question code is similar to a SAX parser only it's pull rather than push. It wouldn't make any difference to this particular problem.

Simon Steele 2010-08-06 17:44:28

Answer 4

+5 A:

It looks like you are re-reading into the XML file over and over again each step, each time you use the from p in SimpleStreamAxis bit you are re-reading and scanning into the file. Also by calling Count() you are walking the full page each time.

Try something like this:

var full = (from p in SimpleStreamAxis(fileName, "product") select p);
int current = 0;

while (full.Any() > 0)
{
    var page = full.Take(pageSize);

    XElement xml = new XElement("catalog",
    from p in page
    select p);

    SubmitXml(connection, fileName, xml.ToString());

    //if the maximum count is set, only load the maximum (in one page)
    if (maximumCount != 0)
        break;

    current++;
    full = full.Skip(pageSize);
}

Note this is untested, but you should hopefully get the idea. You need to avoid enumerating through the file more than once, operations like Count() and Take/Skip are going to take a long time on an 8gb xml file.

Update: I think the above will still iterate through the file more times than we want, you need something a bit more predictable like this:

var full = (from p in SimpleStreamAxis(fileName, "product") select p);
int current = 0;

XElement xml = new XElement("catalog");
int pageIndex = 0;

foreach (var element in full)
{
    xml.Add(element);

    pageIndex++;
    if (pageIndex == pageSize)
    {
        SubmitXml(connection, fileName, xml.ToString());
        xml = new XElement("catalog");
        pageIndex = 0;
    }

    //if the maximum count is set, only load the maximum (in one page)
    if (maximumCount != 0)
        break;

    current++;
}

    // Submit the remainder
if (xml.Elements().Any())
{
    SubmitXml(connection, fileName, xml.ToString());
}

Simon Steele 2010-08-06 17:52:24

Answer 5

+1 A:

If you're using MS SQL Server, use XML Bulk Load for exactly this.
Knowledgebase Article

bowenl2 2010-08-06 18:50:11

ansaurus

tags:

views:

answers:

Problems with HUGE XML files

related questions