tags:

views:

689

answers:

7

I need to parse pretty big XML in PHP (like 300 MB). How can i do it most effectively?

In particular, i need to locate the specific tags and extract their content in a flat TXT file, nothing more.

+1  A: 

The most efficient way to do that is to create static XSLT and apply it to your XML using XSLTProcessor. The method names are a bit misleading. Even though you want to output plain text, you should use either transformToXML() if you need is as a string variable, or transformToURI() if you want to write a file.

vartec
A: 

Depending on your memory requirements, you can either load it up and parse it with XSLT (the memory-consuming route), or you can create a forward-only cursor and walk the tree yourself, printing the values you're looking for (the memory-efficient route).

fatcat1111
+1  A: 

If it's one or few time job I'd use XML Starlet. But if you really want to do it PHP side then I'd recommend to preparse it to smaller chunks and then processing it. If you load it via DOM as one big chunk it will take a lot of memory. Also use CLI side PHP script to speed things up.

raspi
+6  A: 

You can read and parse XML in chunks with an old-school SAX-based parsing approach using PHP's xml parser functions.

Using this approach, there's no real limit to the size of documents you can parse, as you simply read and parse a buffer-full at a time. The parser will fire events to indicate it has found tags, data etc.

There's a simple example in the manual which shows how to pick up start and end of tags. For your purposes you might also want to use xml_set_character_data_handler so that you pick up on the text between tags also.

Paul Dixon
A: 

This is what SAX was designed for. SAX has a low memory footprint reading in a small buffer of data and firing events when it encounter elements, character data etc.

It is not always obvious how to use SAX, well it wasn't to me the first time I used it but in essence you have to maintain your own state and view as to where you are within the document structure so generally you will end up with variables describing what section of the document you are in e.g. inFoo, inBar etc which you set when you encounter particular start/end elements.

There is a short description and example of a sax parser here

Steve Weet
A: 

Pull parsing is the way to go. This way it's memory-efficient AND easy to process. I have been processing files that are as large as 50 Mb or more.

A: 

Problem solved by generating XSL w/ XMLStarlet and then applying it with Xalan/Xerces. Thanks for your help.

Kuroki Kaze