ansaurus

Question

How can I put a lot of data from XML into my database quickly?

Answer 1

+1 A:

The pipe-delimited file might help you improve performance. If your task is CPU-bound, you can split the file into n chunks (where n is the number of processors you have), and run the jobs in parallel.

I doubt Perl will run faster than .NET, but you'll probably get your code working faster.

jrockway 2009-06-16 21:28:02

Thanks. The program is a console app. How would I break it up into separate processes?

Bryan 2009-06-16 21:33:18

jrockway 2009-06-16 21:47:46

(Obviously you need to ensure that this is possible for your program and for your data -- the data in that.file.2 shouldn't depend on any data in that.file.1, etc., etc.)

jrockway 2009-06-16 21:50:49

So I would first have to split the file up and save it into different sections. Then call another program for each file. Development time is limited, but it I like the idea.

Bryan 2009-06-16 21:56:12

Answer 2

+4 A:

I don't think you'll be able to improve LINQ performance on that file to your satisfaction. My recommendation is to use XmlTextReader and read the file node by node and do your own processing of the data.

mjmarsh 2009-06-16 21:28:56

I am processing the file line by line to populate the linq object. Is there something more I can do?

Bryan 2009-06-16 21:34:53

well, then I think the only option is to write your own SQL insert code, using the raw Sql database provider in .NET rather than having LINQ do the DB stuff. You could play with inserting N records at a time in a transcation to see if that has any benefits as well.

mjmarsh 2009-06-16 21:37:28

@mjmarsh: why XmlTextReader and not XmlReader?

John Saunders 2009-06-16 22:27:31

I am currently using XmlReader. I was considering inserting 1000 records at a time or would 100 or 10 be better?

Bryan 2009-06-16 22:33:48

@Bryan: the XMLTextReader question was for mjmarsh. There are few reasons to use XmlTextReader post .NET 2.0. How many rows are best very much depends. SSIS will actually help you tune that process, so you can optimize over time.

John Saunders 2009-06-16 23:31:50

@John No reason, XmlReader would be just fine. My mind still lives in a 2.0 world since I still do a lot of production stuff with it, but my point was a forward-only non-caching reader would be better for the size of data we're talking about

mjmarsh 2009-06-17 14:14:32

@mjmarsh: XmlReader.Create _is_ .NET 2.0. "new XmlTextReader" is .NET 1.1

John Saunders 2009-06-17 20:57:16

Answer 3

+4 A:

If you can get the data in a pipe separated format, you can use SQL tools to import it straight into DB. I guess, you could use BCP to do that.

Why go the route of file -> LINQ objects -> DB?
Also, try to get the file splitted instead of 1 big file with 20 million records.

shahkalpesh 2009-06-16 21:30:32

There is manipulation of the data that must take place and I only want to insert records that belong to certain US cities.

Bryan 2009-06-16 21:37:36

Could you give an example of that?

shahkalpesh 2009-06-16 21:39:33

based on the size of the file, it sure seems like a good candidate for bulk-load interfaces.

Mike Ellery 2009-06-16 21:43:42

Shahkalpesh, I just edited the question to provide an example.

Bryan 2009-06-16 21:50:34

shahkalpesh 2009-06-16 21:58:38

The columns in the temp table could be listingtype, lastnode, description and the values inside it could be RCON, "birdseyeurl", "url" OR RCON, "county", "los angeles" OR RCON, "sqft", "2000 sq ft".

shahkalpesh 2009-06-16 22:01:57

Yes, but this is just a fraction of the manipulation. Performing a pipe delimited direct insert into a second database, then manipulating from the seconds database into the first might be fast, right?

Bryan 2009-06-16 22:06:03

You could use XSLT to coerce the data into the format required for BCP. This is all assuming that it's the inserts that are slowing things down, not the conversion logic.

Jacob 2009-06-16 22:07:21

At the very least, you should have data ready in the format so that it can be inserted into the final table. John's suggestion of SSIS/DTS could also help.

shahkalpesh 2009-06-16 22:49:32

Answer 4

+1 A:

First of all, I'm concerned to hear you talking about number of lines when processing XML. In most cases, an XML document, without the line breaks, is still the same document. Lines don't matter.

Second, you didn't say which RDBMS you're using, but I'll assume SQL Server 2005 or 2008. In this case, if this is a process that you will need to repeat often, then I suggest you do this in SQL Server Integration Services (SSIS). SSIS is optimized for this sort of thing. In particular, it can actually be reading from the XML file, performing your per-row modifications, and writing batches of prepared rows to the database, all at the same time.

The only potential issue would be with the size of the file (BTW, you said it's 20 million lines, but how many MB is that?). SSIS tries to load the entire document into memory at once. This can be a problem. To combat this, I found it relatively easy to create my own custom Source component. This looked under the root element of the document, and returned one child element (and its descendants) at a time. This was simple to do by using the XmlReader.ReadSubTree method, which returns a new XmlReader that only processes the current element and all of its children.

John Saunders 2009-06-16 22:07:26

When I open the 3rd party XML file in LTFViewer it says the number of lines in the status bar.

Bryan 2009-06-16 22:11:52

Lines don't matter in XML, in most cases. You could get rid of all the line terminators, and the document would still have the identical meaning.

John Saunders 2009-06-16 22:12:06

Good point. It is large enough that Visual Studio wont open it. The zipped xml file that my console app extracts and saves is 67 MB.

Bryan 2009-06-16 22:22:01

Just downloaded and extracted ... 1.56 GB. WOW! I did not know that zipping will compress a file that much.

Bryan 2009-06-16 22:31:52

Answer 5

+3 A:

Before you go off fixing things, figure out which part is slow. There are three major parts here:

Parsing the XML
Making the LINQ objects
Inserting it into a database

Which one is taking up all of the time? You might suspect that it is parsing the XML, but verify that before you waste a lot of time. It seems that every time I think I know the answer, I was wrong :) I can't suggest a .NET profiler, but Stackoverflow seems to have answers for that.

I'm not a .NET guy, but I would really be surprised if anything Perl could offer would be significantly faster. It's not a language problem since even the fast Perl stuff is an interface to common libraries.

brian d foy 2009-06-16 22:15:53

Answer 6

+2 A:

Here's a radical thought and I honestly don't know if it'll improve your performance but it has a good chance to do so. I'll bet that you're instantiating your context object once and then using it to insert all the records, right? Doing so means that the context will keep track of all those objects until it is disposed and that might explain the degrading performance over time.

Now, you could clear the context cache but I have a nuttier idea. Those context objects are designed to have minimal overhead on instantiation (it says so in the documentation anyway, I haven't tested that assertion), so it might help if you instantiate the context on each iteration. i.e. re-create the context at the same time you create the object. Or, and this is a better idea, maintain an internal list of your data objects and kick the list to a separate method every n iterations to commit to the database. That method should both instantiate and dispose the context. Make sense?

Jacob Proffitt 2009-06-16 23:25:35

Good idea, though I think you might rather just turn off object tracking globally to address this.

Wyatt Barnett 2009-06-16 23:27:15

That's a good thought Wyatt. I didn't realize that was an option. I'd definitely do that first, though I'd look into batching updates a bit anyway. Once you had your updates batched, you could look at other ways of speeding things up (like kicking the updates off on a separate thread).

Jacob Proffitt 2009-06-17 23:52:59

Answer 7

+1 A:

Brian has the right idea--no need to start breaking stuff before establishing what is broken.

One other crazy trick--see if you can run it on the DB server which takes any network delays out of the picture. We had a few rather massive batch processes that had massive perf improvements from location changes alone.

Wyatt Barnett 2009-06-16 23:26:47

Answer 8

+1 A:

This question on importing the Stack Overflow data dump might be of some use to you. You can probably write the data out to a CSV file, or files using XMLTextReader, and then use your database's bulk importanting mechanism to import the data very quickly.

Kibbee 2009-06-16 23:29:20

Answer 9

+1 A:

First off, decide what is an acceptable amount of time for completion of the import process.

Then, as Brian and Wyatt suggested, do some analysis of where the bulk of the time is being spent. I'm guessing in the LINQ to SQL stuff (although guessing where performance issues are is itself a very risky business).

You can follow two routes, namely, use the ETL tools that are part of MS SQL server to do the import, or continue with some custom .NET code to do the import.

If you choose the latter, I would suggest:

Use ADO.NET only and remove LINQ to SQL stuff;
Run the import program on the database server--eliminating any network traffic issues;
It is unlikely that the XML parser will be any kind of bottleneck, but if it is, consider parsing the XML directly using simple pattern checks as opposed to a full-on compliant parser--of course, this will require the XML to be simple in format, and laid out in line-by-line fashion.

Perl will be slower than .NET since .NET is compiled and Perl isn't.

Eric Smith 2009-06-17 06:30:53

Answer 10

A:

IF you find out that data access is the bottleneck, you might want to try inserting batches of xml objects using SqlBulkCopy. I have xml objects with 18 nodes each and this works for me.

        // read xml file into datatable
        DataSet ds = new DataSet();
        DataTable callList = new DataTable();

        string AppDataPath = ConfigurationManager.AppSettings["AppDataPath"];
        string dbSchema = AppDataPath + "/" + "CLBulkInsertSchema.xml";

        //Create a FileStream to the XML Schema file in Read mode
        FileStream finschema = new FileStream(dbSchema, FileMode.Open,
                               FileAccess.Read, FileShare.Read);

        //Read the Schema into the DataSet
        ds.ReadXml(finschema);

        //Close the FileStream
        finschema.Close();

        //Create a FileStream to the Xml Database file in Read mode
        FileStream findata = new FileStream(tempFilePathName, FileMode.Open,
                             FileAccess.Read, FileShare.ReadWrite);

        //Read the DataBase into the DataSet
        ds.ReadXml(findata);

        //Close the FileStream
        findata.Close();

        DataTable callList = ds.Tables["PhoneBook"];

        string conn = ConfigurationManager.ConnectionStrings["dbConnectionString"].ToString();

        using (SqlConnection connection =
               new SqlConnection(conn))
        {
            connection.Open();

            using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
            {
                bulkCopy.DestinationTableName =
                    "dbo.CallBatchItems";

                // mappings required because we're skipping the BatchItemId column
                // and letting SQL Server handle auto incrementing of primary key.
                // mappings not required if order of columns is exactly the same
                // as destination table definition. 

                bulkCopy.ColumnMappings.Add("BatchId", "BatchId");
                bulkCopy.ColumnMappings.Add("FullName", "FullName");
                ...
                // Write from the source to the destination.
                bulkCopy.WriteToServer(callList);
            }
        }

2009-06-17 07:29:50

ansaurus

tags:

views:

answers:

How can I put a lot of data from XML into my database quickly?

related questions