views:

340

answers:

2

Hiya,

I currently do alot of data parsing, and have toyed with PHP functions for XML such as simple XML and a few others here and there.

But there always seems to be some sort of issue with dealing with them, mainly due to the way the data is presented.

The most reliable way i have found is to always just simply use preg_match_all and regular expression to pull the my data in to the script for processing.

Does anyone see a problem with this? what are the cons of using Regular expression rather than ready build XML parsers?

My main concern is speed and server utilization of resources.

+1  A: 

If you use DOMDocument and DOMXpath, I suspect these will solve your problems.

See http://jp2.php.net/manual/en/class.domdocument.php and http://jp2.php.net/manual/en/class.domxpath.php

Could you provide an example of what you are trying to do, though?

Edit
To directly answer your question, though: regular expressions are easy to mess up -- especially processing hierarchical structures like xml. Even if you do it right, it will likely be slower than using xpath.

Edit 2
Just to add, php's implementation of xpath, DOMXpath only supports xpath 1.0. If you need to use regular expressions to evaluate the contents of an element or one of its attributes, then you'd need something supporting xpath 2.0.... or go with a risky, error-prone regex.

Jonathan Fingland
it varies really, it's just down to working out if there is a problem with me using regular expression and if it's worth using other functions.
Shadi Almosri
+1  A: 

XML parsing is a serious, high-overhead business. If you're data stream is simple enough to parse with regular expressions, that's going to be the most efficient way to parse it.

If you want to do XML parsing while minimizing resources, the SAX parser is probably your best bet. It won't be as efficient as hand-crafted regexes, but it might be good enough.

http://www.brainbell.com/tutorials/php/Parsing%5FXML%5FWith%5FSAX.htm

The DOM and Simple XML parsers require the whole document to be loaded, then they can verify the doc, and then convert it to a node structure, and then you can use the data. Sound like a lot of work for the parser? It is. But for many purposes, it's still appropriate.

For most of my work, I've given up on XML and am using JSON.

Vineel Shah
Yeah JSON is fantastic if available, but we get a download of XML files on FTP and then process them. The problem with RegEx i think is the fact that the whole file is loaded into memory, then split into the parts we need (thus both parts in memory for a little while). So lots of server resources are taken up. I'll check out SAX now...
Shadi Almosri