views:

864

answers:

8

Question

What is the fastest way to dump a large (> 1GB) XML file into a MySQL database?

Data

The data in question is the StackOverflow Creative Commons Data Dump.

Purpose

This will be used in an offline StackOverflow viewer I am building, since I am looking to do some studying/coding in places where I will not have access to the internet.

I would like to release this to the rest of the StackOverflow membership for their own use when the project is finished.

Problem

Originally, I was reading from XML/writing to DB one record at a time. This took about 10 hours to run on my machine. The hacktastic code I'm using now throws 500 records into an array, then creates an insertion query to load all 500 at once (eg. "INSERT INTO posts VALUES (...), (...), (...) ... ;"). While this is faster, it still takes hours to run. Clearly this is not the best way to go about it, so I'm hoping the big brains on this site will know of a better way.

Constraints

  • I am building the application using C# as a desktop application (i.e. WinForms).
  • I am using MySQL 5.1 as my database. This means that features such as "LOAD XML INFILE filename.xml" are not usable in this project, as this feature is only available in MySQL 5.4 and above. This constraint is largely due to my hope that the project would be useful to people other than myself, and I'd rather not force people to use Beta versions of MySQL.
  • I'd like the data load to be built into my application (i.e. no instructions to "Load the dump into MySQL using 'foo' before running this application.").
  • I'm using MySQL Connector/Net, so anything in the MySql.Data namespace is acceptable.

Thanks for any pointers you can provide!


Ideas so far

stored procedure that loads an entire XML file into a column, then parses it using XPath

  • This didn't work since the file size is subject to the limitations of the max_allowed_packet variable, which is set to 1 MB by default. This is far below the size of the data dump files.
A: 

Does this help at all? It's a stored procedure that loads an entire XML file into a column, then parses it using XPath and creates a table / inserts the data from there. Seems kind of crazy, but it might work.

Chris Shouts
I tried, but I've been running into issues getting it to work properly. I think the real deal breaker of it is the max_allowed_packet, which is set to 1 MB by default. Since it's reading the whole XML file, which is about 1.2 GB for the posts.xml file, I think this is unusable.
AgentConundrum
I see what you mean - it looks the maximum value for max_allowed_packet is 1 GB, which still isn't large enough. Would it be possible to break up the XML document at logical partitions (say, one document per table) into multiple XML documents before feeding them to MySQL?
Chris Shouts
The problem persists even if we map the files 1-to-1 with the tables. The posts.xml file in the data dump is 1.19 GB for September, so I'd have to break at least that file apart. I'm also just not comfortable demanding a potential user to modify their MySQL installation to allow the 1GB packet. I'd like to keep this as vanilla as possible so as not to cause issues. e.g. "give me your server address and credentials, and I'll do the rest... oh, and would you mind making sure everything is transmitted in utf8 since that's not nearly as default as the docs say? thanks.."
AgentConundrum
A: 

Not the answer you want, but the mysql c api has the mysql_stmt_send_long_data function.

hjb417
+7  A: 

There are 2 parts to this:

  • reading the xml file
  • writing to the database

For reading the xml file, this link http://csharptutorial.blogspot.com/2006/10/reading-xml-fast.html , shows that 1 MB can be read in 2.4 sec using stream reader, that would be 2400 seconds or 40 mins (if my maths is working this late) for 1 GB file.

From what I have read the fastest way to get data into MySQL is to use LOAD DATA.

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

Therefore, if you can read the xml data, write it to files that can be used by LOAD DATA, then run LOAD DATA. The total time may be less than the hours that you are experiancing.

Shiraz Bhaiji
It's actually a lot faster to read the XML file than you think. I'm using an `XmlReader` to read the file, and I quickly checked how long it takes to read the file and prepare the queries. By reading only PostTypeId and Body (the only two fields I'm playing with) into Strings (to ensure they are read, since I'm not sure when parameterized queries pull their data), and commenting out the `cmd.ExecuteNonQuery()` statement, the process took 1 minute, 12 seconds. The bottleneck is very much in the database portion, not the data read. I'll look into recreating LOAD DATA-readable files, thanks.
AgentConundrum
You may be able to further improve performance by first splitting the XML input into multiple XML files so that they can be processed in parallel.
LBushkin
Hi Shiraz, I just finally got around to trying your suggestion regarding creating temp files and using LOAD DATA to process them. It took 2m22 to create the file and 2m35s to load the file into the database. There were a number of warnings I need to look into, but I'm sure these will be fixed with some minor tweaks. 5 minutes to load 1.2 million records is well beyond my expectations. Enjoy your bounty, it was well earned!
AgentConundrum
The errors were exactly what I thought they would be. I erroneously assumed that there would be no double-quotes in the Body field since that would close the xml attribute in the original file, and thus render the xml invalid. What I failed to consider was that the XmlReader I was using was decoding entities as it went, so there were double quotes present, which was screwing up the CSV file. I escaped all slashes, then additionally escaped the quotes, and everything was great. 3m34s to create the file, then 2m43s to do the MySQL load - total of 6m17s to load the PostTypeId and Body fields!
AgentConundrum
A: 

I noticed in one of your comments above that you are considering MSSQL, so I thought I'd post this. SQL Server has a utility called SQMLXMLBulkLoad which is designed to import large amounts of XML Data into a SQL Server database. Here is the documentation for the SQL Sever 2008 version:

http://msdn.microsoft.com/en-us/library/ms171993.aspx

Earlier versions of SQL Server also have this utility

RMorrisey
This is partially why I asked this question, actually. I was disappointed with the 10 hour load time, but somewhat shrugged it off as "well, it is a lot of data.. besides, i'll have plenty of time to speed it up later" but then I saw a post on Meta complaining about an 8 MINUTE load time to MSSQL and I got pissed. The guy put his code on github and this was what he was using. Well, SqlBulkCopy, not the XML version you speak of. Thanks for the info, I'll definitely keep it in mind when I start playing with other databases (this is a toy project of mine, for learning/portfolio purposes). Thanks!
AgentConundrum
+1  A: 

Ok, I'm going to be an idiot here and answer your question with a question.

Why put it in a database?

What if ... just a what-if... you wrote the xml to files on local drive and, if needed, write some indexing information in the database. This should perform significantly faster than trying to load a database and would much more portable. All you would need on top of it is a way to search and a way to index relational references. There should be plenty of help with searching, and the relational aspect should be easy enough to build? You might even consider re-writing the information so that each file contains a single post with all the answers and comments right there.

Anyway, just my two-cents (and that is not worth a dime).

csharptest.net
How about something like the http://incubator.apache.org/lucene.net/ port of http://lucene.apache.org/
Si
I like the Lucene idea, but I would honestly expect pulling the data into lucene to take *longer* than getting it into a database
Frank Farmer
+1  A: 

I have a few thoughts to help speed this up...

  1. The size of the query may need to be tweaked, there's often a point where the big statement costs more in parsing time and so becomes slower. The 500 may be optimal, but perhaps it is not and you could tweak that a little (it could be more, it could be less).

  2. Go multithreaded. Assuming your system isn't already flatlined on the processing, you could make some gains by having breaking up the data in to chunks and having threads process them. Again, it's an experimentation thing to find the optimal number of threads, but a lot of people are using multicore machines and have CPU cycles to spare.

  3. On the database front, make sure that the table is as bare as it can be. Turn off any indexes and load the data before indexing it.

John Cavan
A: 

SqlBulkCopy ROCKS. I used it to turn a 30 min function to 4 seconds. However this is applicable only to MS SQL Server.

Might I suggest you look at the constraints on your table you've created? If you drop all keys on the database, constraints etc, the database will do less work on your insertions and less recursive work.

Secondly setup the tables with big initial sizes to prevent your resizes if you are inserting into a blank database.

Finally see if there is a bulk copy style API for MySQL. SQL Server basically formats the data as it would go down to disk and the SQL server links the stream up to the disk and you pump in data. It then performs one consistency check for all the data instead of one per insert, dramatically improving your performance. Good luck ;)

Do you need MySQL? SQL Server makes your life easier if you are using Visual Studio and your database is low performance/size.

Spence
I don't *need* MySQL; it's simply what I already had installed. This is, after all, primarily a personal/portfolio project. Having never worked with MSSQL, it'll be a secondary feature once the main project is done. I'll definitely be using SqlBulkCopy when the time comes. The nearest thing for MySQL seems to be a cover for the LOAD DATA syntax. As for indexes, etc. the table is bare since records will only be added once. I'll check into your 'big initial sizes' suggestion. Thanks for the ideas.
AgentConundrum
A: 

In PostgreSQL, the absolute fastest way to get bulk data in is to drop all indexes and triggers, use the equivalent of MySQL's LOAD DATA and then recreate your indexes/triggers. I use this technique to pull 5 GB of forum data into a PostgreSQL database in roughly 10 minutes.

Granted, this may not apply to MySQL, but it's worth a shot. Also, this SO question's answer suggests that this is in fact a viable strategy for MySQL.

A quick google turned up some tips on increasing the performance of MySQL's LOAD DATA.

Frank Farmer