tags:

views:

973

answers:

5

I'm working with a 20 gig XML file that I would like to import into a SQL database (preferably MySQL, since that is what I am familiar with). This seems like it would be a common task, but after Googling around a bit I haven't been able to figure out how to do it. What is the best way to do this?

I know this ability is built into MySQL 6.0, but that is not an option right now because it is an alpha development release.

Also, if I have to do any scripting I would prefer to use Python because that's what I am most familiar with.

Thanks.

+1  A: 

I've done this several times with Python, but never with such a big XML file. ElementTree is an excellent XML library for Python that would be of assistance. If it was possible, I would divide the XML up into smaller files to make it easier to load into memory and parse.

Joel Hooks
SAX Parser is helpful for breaking up a big XML file into pieces.
S.Lott
+2  A: 

You can use the getiterator() function to iterate over the XML file without parsing the whole thing at once. You can do this with ElementTree, which is included in the standard library, or with lxml.

for record in root.getiterator('record'):
    add_element_to_database(record) # Depends on your database interface.
                                    # I recommend SQLAlchemy.
Ryan Ginstrom
A: 

It may be a common task, but maybe 20GB isn't as common with MySQL as it is with SQL Server.

I've done this using SQL Server Integration Services and a bit of custom code. Whether you need either of those depends on what you need to do with 20GB of XML in a database. Is it going to be a single column of a single row of a table? One row per child element?

SQL Server has an XML datatype if you simply want to store the XML as XML. This type allows you to do queries using XQuery, allows you to create XML indexes over the XML, and allows the XML column to be "strongly-typed" by referring it to a set of XML schemas, which you store in the database.

John Saunders
I'd really like to get rid of the XML and just store the data as appropriately-named columns. It's a pretty simple format (documents from a public content site with a download API).
Jacob Lyles
A: 

Take a look at the iterparse() function from ElementTree or cElementTree (I guess cElementTree would be best if you can use it)

This piece describes more or less what you need to do: http://effbot.org/zone/element-iterparse.htm#incremental-parsing

This will probably be the most efficient way to do it in Python. Make sure not to forget to call .clear() on the appropriate elements (you really don't want to build an in memory tree of a 20gig xml file: the .getiterator() method described in another answer is slightly simpler, but does require the whole tree first - I assume that the poster actually had iterparse() in mind as well)

A: 

The MySQL documentation does not seem to indicate that XML import is restricted to version 6. It apparently works with 5, too.

bortzmeyer