ansaurus

Question

Problem reading files greater than 1GB with XMLReader

Answer 1

A:

I've run into a similar issue when parsing large documents. What I wound up doing is breaking the feed into smaller chunks using filesystem functions, then parsing those smaller chunks... So if you have a bunch of <record> tags that you are parsing, parse them out with string functions as a stream, and when you get a full record in the buffer, parse that using the xml functions... It sucks, but it works quite well (and is very memory efficient, since you only have at most 1 record in memory at any one time)...

ircmaxell 2010-08-06 14:53:45

Thanks, yes that's what I ended up doing as well. But as you mentioned, it sucks :o)Do you happen to know as a fact wether or not there is a max file size the xml reader can read?

A boy named Su 2010-08-06 15:11:24

Thanks again for your suggestion, I discovered the source of error and a solution that has been working for me so far and thought you might be able to implement it.It turns out that there was a vertical tab in feed (^K or char 11) which isn't an invalid character but invalid for the document type I was using.I ran the feed through a sed find and replace before processing the feed and have since been able to parse fields greater than 2gb.Thanks to everybody else for your suggestions.

A boy named Su 2010-08-19 14:14:23

Answer 2

+1 A:

Splitting up the file will definitely help. Other things to try...

adjust the memory_limit variable in php.ini. http://php.net/manual/en/ini.core.php
rewrite your parser using SAX -- http://php.net/manual/en/book.xml.php . This is a stream-oriented parser that doesn't need to parse the whole tree. Much more memory-efficient but slightly harder to program.

Depending on your OS, there might also be a 2gb limit on the RAM chunk that you can allocate. Very possible if you're running on a 32-bit OS.

Vineel Shah 2010-08-06 15:05:37

The XMLReader interface is supposed to handle large documents sequentially like a SAX parser, i.e. it doesn't (necessarily) load the entire document into memory.

VolkerK 2010-08-06 15:41:28

@Vineel, thanks for that. Had already adjusted the internal memory.VolkerK is right as well. XMLReader reads in a similar manner to to SAX parser. I will try it with SAX if all else fails but would rather not having to rewrite the script.

A boy named Su 2010-08-06 16:29:52

Answer 3

A:

Do you get any errors with

libxml_use_internal_errors(true);
libxml_clear_errors();

// your parser stuff here....    
$r = new XMLReader(...);
// ....


foreach( libxml_get_errors() as $err ) {
   printf(". %d %s\n", $err->code, $err->message);
}

when the parser stops prematurely?

VolkerK 2010-08-06 15:44:54

No, don't get any. I'm putting together a standalone copy of the script that may shed some more light on the problem, but I'm quite certain it's not a problem with the XML or the PHP script itself.As long as the file is less than 1GB it runs the way it's supposed to with no problem.even when larger, it runs fine, just doesn't read all the xml.Thanks for the suggestion though.

A boy named Su 2010-08-06 16:26:27

"but I'm quite certain it's not a problem with the XML or the PHP script itself." - Only to make sure: The libxml_get_errors() thingy was not to imply there's something wrong with the script or the xml document. I thought libxml might complain about a failed file seek or a text node that is larger than the allowed maximum (which by default is 10MB) or something like that. If you ran into the problem without libxml_get_errors() returning an error this idea is dead :(

VolkerK 2010-08-06 16:33:46

:o) I know that's what you implied. I'm not sensitive - i wasn't being defensive. Sorry if I came across as such.

A boy named Su 2010-08-06 16:40:45

Answer 4

A:

Using WindowsXP, NTFS as filesystem and php 5.3.2 there was no problem with this test script

<?php
define('SOURCEPATH', 'd:/test.xml');

if ( 0 ) {
  build();
}
else {
  echo 'filesize: ', number_format(filesize(SOURCEPATH)), "\n";
  timing('read');
}

function timing($fn) {
  $start = new DateTime();
  echo 'start: ', $start->format('Y-m-d H:i:s'), "\n";
  $fn();
  $end = new DateTime();
  echo 'end: ', $start->format('Y-m-d H:i:s'), "\n";
  echo 'diff: ', $end->diff($start)->format('%I:%S'), "\n";
}

function read() {
  $cnt = 0;
  $r = new XMLReader;
  $r->open(SOURCEPATH);
  while( $r->read() ) {
    if ( XMLReader::ELEMENT === $r->nodeType ) {
      if ( 0===++$cnt%500000 ) {
        echo '.';
      }
    }
  }
  echo "\n#elements: ", $cnt, "\n";
}

function build() {
  $fp = fopen(SOURCEPATH, 'wb');

  $s = '<catalogue>';
  //for($i = 0; $i < 500000; $i++) {
  for($i = 0; $i < 60000000; $i++) {
    $s .= sprintf('<item>%010d</item>', $i);
    if ( 0===$i%100000 ) {
      fwrite($fp, $s);
      $s = '';
      echo $i/100000, ' ';
    }
  }

  $s .= '</catalogue>';
  fwrite($fp, $s);
  flush($fp);
  fclose($fp);
}

output:

filesize: 1,380,000,023
start: 2010-08-07 09:43:31
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:43:31
diff: 07:31

(as you can see I screwed up the output of the end-time but I don't want to run this script another 7+ minutes ;-))

Does this also work on your system?

As a side-note: The corresponding C# test application took only 41 seconds instead of 7,5 minutes. And my slow harddrive might have been the/one limiting factor in this case.

filesize: 1.380.000.023
start: 2010-08-07 09:55:24
........................................................................................................................

#elements: 60000001

end: 2010-08-07 09:56:05
diff: 00:41

and the source:

using System;
using System.IO;
using System.Xml;

namespace ConsoleApplication1
{
  class SOTest
  {
    delegate void Foo();
    const string sourcepath = @"d:\test.xml";
    static void timing(Foo bar)
    {
      DateTime dtStart = DateTime.Now;
      System.Console.WriteLine("start: " + dtStart.ToString("yyyy-MM-dd HH:mm:ss"));
      bar();
      DateTime dtEnd = DateTime.Now;
      System.Console.WriteLine("end: " + dtEnd.ToString("yyyy-MM-dd HH:mm:ss"));
      TimeSpan s = dtEnd.Subtract(dtStart);
      System.Console.WriteLine("diff: {0:00}:{1:00}", s.Minutes, s.Seconds);
    }

    static void readTest()
    {
      XmlTextReader reader = new XmlTextReader(sourcepath);
      int cnt = 0;
      while (reader.Read())
      {
        if (XmlNodeType.Element == reader.NodeType)
        {
          if (0 == ++cnt % 500000)
          {
            System.Console.Write('.');
          }
        }
      }
      System.Console.WriteLine("\n#elements: " + cnt + "\n");
    }

    static void Main()
    {
      FileInfo f = new FileInfo(sourcepath);
      System.Console.WriteLine("filesize: {0:N0}", f.Length);
      timing(readTest);
      return;
    }
  }
}

VolkerK 2010-08-07 08:00:35

Answer 5

A:

It should be noted that PHP in general has a max file size. PHP does not allow for unsigned integers, or long integers, meaning you're capped at 2^31 (or 2^63 for 64 bit systems) for integers. This is important because PHP uses an integer for the file pointer (your position in the file as you read through), meaning it cannot process a file larger than 2^31 bytes in size.

However, this should be more than 1 gigabyte. I ran into issues with two gigabytes (as expected, since 2^31 is roughly 2 billion).

Soup d'Campbells 2010-08-11 21:16:07

ansaurus

tags:

views:

answers:

Problem reading files greater than 1GB with XMLReader

related questions