tags:

views:

71

answers:

3

So I have 16 GB worth of XML files to process (about 700 files total), and I already have a functional PHP script to do that (With XMLReader) but it's taking forever. I was wondering if parsing in Python would be faster (Python being the only other language I'm proficient in, I'm sure something in C would be faster).

+2  A: 

I think that both of them can rely over wrappers for fast C libraries (mostly libxml2) so there's shouldn't be too much difference in parsing per se.

You could try if there are differences caused by overhead, then it depends what are you gonna do over that XML. Parsing it for what?

Jack
Parsing it to feed it in a MySQL database and MongoDB
Michael
A: 

I can't tell you for sure if Python will end up performing better than PHP (because I'm not terribly familiar with the performance characteristics of PHP). I can, however, give you a few suggestions.

  1. If there's a huge difference between your understanding of Python and PHP (i.e. you know way more PHP than Python, stick with PHP. The worst thing for performance in any language is a lack of mastery.
  2. If you want to implement a Python solution, there's a lot in the library to work with, and depending on what you're looking for, you can find it here.
  3. Write a Python script to process the XML, and then use it on one item. Compare that script's running time to the PHP script. If the Python script is much faster and you have faith that it is bugfree, use Python.

Also, if you have some knowledge of C, in Python you can identify bottlenecks in the code and easily reimplement them in C (though I suspect you won't have a chance to do this).

Rafe Kettler
A: 

There's actually three differing performance problems here:

  • The time it takes to parse a file, which depends on the size of individual files.
  • The time it takes to handle the files and directories in the filesystem, if there's a lot of them.
  • Writing the data into your databases.

Where you should look for performance improvements depends on which one of these is the biggest bottleneck.

My guess is that the last one is the biggest problem because writes is almost always the slowest: writes can't be cached, they requires writing to disk and if the data is sorted it can take a considerable time to find the right spot to write it.

You presume that the bottleneck is the first alternative, the XML parsing. If that is the case, changing language is not the first thing to do. Instead you should see if there's some sort of SAX parser for your language. SAX parsing is much faster and memory effective than DOM parsing.

Emil Vikström