views:

412

answers:

10

I am using .NET to parse an XML file with about 20 million lines (1.56 GB), building LINQ objects out of the data, then inserting it into a SQL database. It is taking a really long time.

To improve performance I am considering asking for a pipe delimited file. I was also wondering if Perl might be any faster. Does anyone have suggestions for speeding up this process?

+1  A: 

The pipe-delimited file might help you improve performance. If your task is CPU-bound, you can split the file into n chunks (where n is the number of processors you have), and run the jobs in parallel.

I doubt Perl will run faster than .NET, but you'll probably get your code working faster.

jrockway
Thanks. The program is a console app. How would I break it up into separate processes?
Bryan
jrockway
(Obviously you need to ensure that this is possible for your program and for your data -- the data in that.file.2 shouldn't depend on any data in that.file.1, etc., etc.)
jrockway
So I would first have to split the file up and save it into different sections. Then call another program for each file. Development time is limited, but it I like the idea.
Bryan
+4  A: 

I don't think you'll be able to improve LINQ performance on that file to your satisfaction. My recommendation is to use XmlTextReader and read the file node by node and do your own processing of the data.

mjmarsh
I am processing the file line by line to populate the linq object. Is there something more I can do?
Bryan
well, then I think the only option is to write your own SQL insert code, using the raw Sql database provider in .NET rather than having LINQ do the DB stuff. You could play with inserting N records at a time in a transcation to see if that has any benefits as well.
mjmarsh
@mjmarsh: why XmlTextReader and not XmlReader?
John Saunders
I am currently using XmlReader. I was considering inserting 1000 records at a time or would 100 or 10 be better?
Bryan
@Bryan: the XMLTextReader question was for mjmarsh. There are few reasons to use XmlTextReader post .NET 2.0. How many rows are best very much depends. SSIS will actually help you tune that process, so you can optimize over time.
John Saunders
@John No reason, XmlReader would be just fine. My mind still lives in a 2.0 world since I still do a lot of production stuff with it, but my point was a forward-only non-caching reader would be better for the size of data we're talking about
mjmarsh
@mjmarsh: XmlReader.Create _is_ .NET 2.0. "new XmlTextReader" is .NET 1.1
John Saunders
+4  A: 

If you can get the data in a pipe separated format, you can use SQL tools to import it straight into DB. I guess, you could use BCP to do that.

Why go the route of file -> LINQ objects -> DB?
Also, try to get the file splitted instead of 1 big file with 20 million records.

shahkalpesh
There is manipulation of the data that must take place and I only want to insert records that belong to certain US cities.
Bryan
Could you give an example of that?
shahkalpesh
based on the size of the file, it sure seems like a good candidate for bulk-load interfaces.
Mike Ellery
Shahkalpesh, I just edited the question to provide an example.
Bryan
shahkalpesh
The columns in the temp table could be listingtype, lastnode, description and the values inside it could be RCON, "birdseyeurl", "url" OR RCON, "county", "los angeles" OR RCON, "sqft", "2000 sq ft".
shahkalpesh
Yes, but this is just a fraction of the manipulation. Performing a pipe delimited direct insert into a second database, then manipulating from the seconds database into the first might be fast, right?
Bryan
You could use XSLT to coerce the data into the format required for BCP. This is all assuming that it's the inserts that are slowing things down, not the conversion logic.
Jacob
At the very least, you should have data ready in the format so that it can be inserted into the final table. John's suggestion of SSIS/DTS could also help.
shahkalpesh
+1  A: 

First of all, I'm concerned to hear you talking about number of lines when processing XML. In most cases, an XML document, without the line breaks, is still the same document. Lines don't matter.

Second, you didn't say which RDBMS you're using, but I'll assume SQL Server 2005 or 2008. In this case, if this is a process that you will need to repeat often, then I suggest you do this in SQL Server Integration Services (SSIS). SSIS is optimized for this sort of thing. In particular, it can actually be reading from the XML file, performing your per-row modifications, and writing batches of prepared rows to the database, all at the same time.

The only potential issue would be with the size of the file (BTW, you said it's 20 million lines, but how many MB is that?). SSIS tries to load the entire document into memory at once. This can be a problem. To combat this, I found it relatively easy to create my own custom Source component. This looked under the root element of the document, and returned one child element (and its descendants) at a time. This was simple to do by using the XmlReader.ReadSubTree method, which returns a new XmlReader that only processes the current element and all of its children.

John Saunders
When I open the 3rd party XML file in LTFViewer it says the number of lines in the status bar.
Bryan
Lines don't matter in XML, in most cases. You could get rid of all the line terminators, and the document would still have the identical meaning.
John Saunders
Good point. It is large enough that Visual Studio wont open it. The zipped xml file that my console app extracts and saves is 67 MB.
Bryan
Just downloaded and extracted ... 1.56 GB. WOW! I did not know that zipping will compress a file that much.
Bryan
+3  A: 

Before you go off fixing things, figure out which part is slow. There are three major parts here:

  • Parsing the XML
  • Making the LINQ objects
  • Inserting it into a database

Which one is taking up all of the time? You might suspect that it is parsing the XML, but verify that before you waste a lot of time. It seems that every time I think I know the answer, I was wrong :) I can't suggest a .NET profiler, but Stackoverflow seems to have answers for that.

I'm not a .NET guy, but I would really be surprised if anything Perl could offer would be significantly faster. It's not a language problem since even the fast Perl stuff is an interface to common libraries.

brian d foy
+2  A: 

Here's a radical thought and I honestly don't know if it'll improve your performance but it has a good chance to do so. I'll bet that you're instantiating your context object once and then using it to insert all the records, right? Doing so means that the context will keep track of all those objects until it is disposed and that might explain the degrading performance over time.

Now, you could clear the context cache but I have a nuttier idea. Those context objects are designed to have minimal overhead on instantiation (it says so in the documentation anyway, I haven't tested that assertion), so it might help if you instantiate the context on each iteration. i.e. re-create the context at the same time you create the object. Or, and this is a better idea, maintain an internal list of your data objects and kick the list to a separate method every n iterations to commit to the database. That method should both instantiate and dispose the context. Make sense?

Jacob Proffitt
Good idea, though I think you might rather just turn off object tracking globally to address this.
Wyatt Barnett
That's a good thought Wyatt. I didn't realize that was an option. I'd definitely do that first, though I'd look into batching updates a bit anyway. Once you had your updates batched, you could look at other ways of speeding things up (like kicking the updates off on a separate thread).
Jacob Proffitt
+1  A: 

Brian has the right idea--no need to start breaking stuff before establishing what is broken.

One other crazy trick--see if you can run it on the DB server which takes any network delays out of the picture. We had a few rather massive batch processes that had massive perf improvements from location changes alone.

Wyatt Barnett
+1  A: 

This question on importing the Stack Overflow data dump might be of some use to you. You can probably write the data out to a CSV file, or files using XMLTextReader, and then use your database's bulk importanting mechanism to import the data very quickly.

Kibbee
+1  A: 

First off, decide what is an acceptable amount of time for completion of the import process.

Then, as Brian and Wyatt suggested, do some analysis of where the bulk of the time is being spent. I'm guessing in the LINQ to SQL stuff (although guessing where performance issues are is itself a very risky business).

You can follow two routes, namely, use the ETL tools that are part of MS SQL server to do the import, or continue with some custom .NET code to do the import.

If you choose the latter, I would suggest:

  1. Use ADO.NET only and remove LINQ to SQL stuff;
  2. Run the import program on the database server--eliminating any network traffic issues;
  3. It is unlikely that the XML parser will be any kind of bottleneck, but if it is, consider parsing the XML directly using simple pattern checks as opposed to a full-on compliant parser--of course, this will require the XML to be simple in format, and laid out in line-by-line fashion.

Perl will be slower than .NET since .NET is compiled and Perl isn't.

Eric Smith
A: 

IF you find out that data access is the bottleneck, you might want to try inserting batches of xml objects using SqlBulkCopy. I have xml objects with 18 nodes each and this works for me.

        // read xml file into datatable
        DataSet ds = new DataSet();
        DataTable callList = new DataTable();

        string AppDataPath = ConfigurationManager.AppSettings["AppDataPath"];
        string dbSchema = AppDataPath + "/" + "CLBulkInsertSchema.xml";

        //Create a FileStream to the XML Schema file in Read mode
        FileStream finschema = new FileStream(dbSchema, FileMode.Open,
                               FileAccess.Read, FileShare.Read);

        //Read the Schema into the DataSet
        ds.ReadXml(finschema);

        //Close the FileStream
        finschema.Close();

        //Create a FileStream to the Xml Database file in Read mode
        FileStream findata = new FileStream(tempFilePathName, FileMode.Open,
                             FileAccess.Read, FileShare.ReadWrite);

        //Read the DataBase into the DataSet
        ds.ReadXml(findata);

        //Close the FileStream
        findata.Close();

        DataTable callList = ds.Tables["PhoneBook"];

        string conn = ConfigurationManager.ConnectionStrings["dbConnectionString"].ToString();

        using (SqlConnection connection =
               new SqlConnection(conn))
        {
            connection.Open();

            using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
            {
                bulkCopy.DestinationTableName =
                    "dbo.CallBatchItems";

                // mappings required because we're skipping the BatchItemId column
                // and letting SQL Server handle auto incrementing of primary key.
                // mappings not required if order of columns is exactly the same
                // as destination table definition. 

                bulkCopy.ColumnMappings.Add("BatchId", "BatchId");
                bulkCopy.ColumnMappings.Add("FullName", "FullName");
                ...
                // Write from the source to the destination.
                bulkCopy.WriteToServer(callList);
            }
        }