views:

93

answers:

4

I'm relatively new to parsing XML files and am attempting to read a large XML file with XMLReader.

<?xml version="1.0" encoding="UTF-8"?>
<ShowVehicleRemarketing environment="Production" lang="en-CA" release="8.1-Lite" xsi:schemaLocation="http://www.starstandards.org/STAR /STAR/Rev4.2.4/BODs/Standalone/ShowVehicleRemarketing.xsd">
  <ApplicationArea>
    <Sender>
      <Component>Component</Component>
      <Task>Task</Task>
      <ReferenceId>w5/cron</ReferenceId>
      <CreatorNameCode>CreatorNameCode</CreatorNameCode>
      <SenderNameCode>SenderNameCode</SenderNameCode>
      <SenderURI>http://www.example.com&lt;/SenderURI&gt;
      <Language>en-CA</Language>
      <ServiceId>ServiceId</ServiceId>
    </Sender>
    <CreationDateTime>CreationDateTime</CreationDateTime>
    <Destination>
      <DestinationNameCode>example</DestinationNameCode>
    </Destination>
  </ApplicationArea>
...

I am recieving the following error

ErrorException [ Warning ]: XMLReader::read() [xmlreader.read]: compress.zlib://D:/WebDev/example/local/public/../upload/example.xml.gz:2: namespace error : Namespace prefix xsi for schemaLocation on ShowVehicleRemarketing is not defined

I've searched around and can't find much useful information on using XMLReader to read XML files with namespaces -- How would I go about defining a namespace, if that is in fact what I need to do.. little help? links to pertinent resources?

+4  A: 

There needs to be a definition of the xsi namespace. E.g.

<ShowVehicleRemarketing
  environment="Production"
  lang="en-CA"
  release="8.1-Lite"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.starstandards.org/STAR/STAR/Rev4.2.4/BODs/Standalone/ShowVehicleRemarketing.xsd"
>

Update: You could write a user defined filter and then let the XMLReader use that filter, something like:

stream_filter_register('darn', 'DarnFilter');
$src = 'php://filter/read=darn/resource=compress.zlib://something.xml.gz';
$reader->open($src);

The contents read by the compress.zlib wrapper is then "routed" through the DarnFilter which has to find the (first) location where it can insert the xmlns:xsi declaration. But this is quite messy and will take some afford to do it right (e.g. theoretically bucket A could contain xs, bucket B i:schem and bucket C aLocation=")


Update 2: here's an ad-hoc example of a filter in php that inserts the xsi namespace declaration. Mostly untested (worked with the one test I ran ;-) ) and undocumented. Take it as a proof-of-concept not production-code.

<?php
stream_filter_register('darn', 'DarnFilter');
$src = 'php://filter/read=darn/resource=compress.zlib://d:/test.xml.gz';

$r = new XMLReader;
$r->open($src);
while($r->read()) {
  echo '.';
}

class DarnFilter extends php_user_filter {
  protected $buffer='';
  protected $status = PSFS_FEED_ME;

  public function filter($in, $out, &$consumed, $closing)
  {
    while ( $bucket = stream_bucket_make_writeable($in) ) {
      $consumed += $bucket->datalen;
      if ( PSFS_PASS_ON == $this->status ) {
        // we're already done, just copy the content
        stream_bucket_append($out, $bucket);
      }
      else {
        $this->buffer .= $bucket->data;
        if ( $this->foo() ) {
          // first element found
          // send the current buffer          
          $bucket->data = $this->buffer;
          $bucket->datalen = strlen($bucket->data);
          stream_bucket_append($out, $bucket);
          $this->buffer = null;
          // no need for further processing
          $this->status = PSFS_PASS_ON;
        }
      }
    }
    return $this->status;
  }

  /* looks for the first (root) element in $this->buffer
  *  if it doesn't contain a xsi namespace decl inserts it
  */
  protected function foo() {
    $rc = false;
    if ( preg_match('!<([^?>\s]+)\s?([^>]*)>!', $this->buffer, $m, PREG_OFFSET_CAPTURE) ) {
      $rc = true;
      if ( false===strpos($m[2][0], 'xmlns:xsi') ) {
        echo ' inserting xsi decl ';
        $in = '<'.$m[1][0]
          . ' xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" '
          . $m[2][0] . '>';    
        $this->buffer = substr($this->buffer, 0, $m[0][1])
          . $in
          . substr($this->buffer, $m[0][1] + strlen($m[0][0]));
      }
    }
    return $rc;
  }
}

Update 3: And here's an ad-hoc solution written in C#

XmlNamespaceManager nsmgr = new XmlNamespaceManager(new NameTable());
// prime the XMLReader with the xsi namespace
nsmgr.AddNamespace("xsi", "http://www.w3.org/2001/XMLSchema-instance");

using ( XmlReader reader = XmlTextReader.Create(
  new GZipStream(new FileStream(@"\test.xml.gz", FileMode.Open, FileAccess.Read), CompressionMode.Decompress),
  new XmlReaderSettings(),
  new XmlParserContext(null, nsmgr, null, XmlSpace.None)
)) {
  while (reader.Read())
  {
    System.Console.Write('.');
  }
}
VolkerK
Okay.. so say the XML is remote and I can't change it -- is there a way to just ignore that fact that the document appears to be malformed, ie lacking a namespace definition?
FelixHCat
I don't think php's XMLReader has an option to ignore that kind of error or a means of "injecting" a namespace declaration. Looks like you have to alter the documents, maybe on-the-fly but that will not exactly boost the performance. Is PHP your only option? E.g. the dotnet XMLReader can be initialized with an XmlParserContext that already "contains" predefined namespaces. see http://msdn.microsoft.com/en-us/library/xc8bact5.aspx
VolkerK
PHP is the only option -- is there a way, do you suppose, to alter the document before I attempt to read it without loading the whole thing into memory? A couple further complications -- It's gzipped and ~300Mb uncompressed.. Things are starting to look complicated/hopeless
FelixHCat
see update. It sounds like the requirements are not "within" the sweet spot of php. Feel free to explain why php is the only option (and feel also free to refuse ;-) )
VolkerK
@Volker I suggested a Stream Wrapper in my comments too. Could str_replace the namespace declaration as well in it.
Gordon
@Gordon: Still it's ugly and I only tentatively suggest this solution.
VolkerK
@Volker I find StreamWrappers intrigueing. I never got too deep into them though, but the idea of having a transparent proxy doesnt sound too ugly to me. I mean, the str_replace'ing definitely is, but the StreamWrapper?
Gordon
@Gordon: According to the error message there's already a wrapper involved (compress.zlib). I.e. if you want to write a _wrapper_ for this you'd have to handle the compress in it as well. It would probably be more feasible (and port-/configurable) to write a _filter_. Anyway, you'd need to find a place to put the extra attribute in a somewhat reliable, flexible way. And then you'd have another call to a user defined method for each data chunk (+ some extra processing until you've inserted the attribute), which will make it even slower to process the 300MB of xml data. Let me try... ;-)
VolkerK
+1  A: 

You can file_get_contents and str_replace the XML before passing it to XMLReader.

Either insert the required namespace declararation for the xsi prefix:

$reader = new XMLReader;
$reader->xml(str_replace(
    '<ShowVehicleRemarketing',
    '<ShowVehicleRemarketing xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"',
    file_get_contents('http://example.com/data.xml')));

Another option would be to remove the schemaLocation attribute:

$reader->xml(str_replace(
    'xsi:schemaLocation="http://www.starstandards.org/STAR /STAR/Rev4.2.4/BODs/Standalone/ShowVehicleRemarketing.xsd"',
    '',
    file_get_contents('http://example.com/data.xml')));

However, if there is more prefixes in the document, you will have to replace all of them.

Gordon
*sigh* That would work fine if the file wasn't ~300MbPerhaps I should explore some option to try and re-write <ShowVehicleRemarketing> without loading the whole file into memory?
FelixHCat
@Felix hmm, I've never tried that, but you might be able to use the [libxml functions](http://de.php.net/manual/en/function.libxml-set-streams-context.php) to register a custom stream filter that modifies the data before it is processed by XmlReader.
Gordon
A: 

Either fix whatever's writing out malformed XML, or write a separate tool to perform the fix later. (It doesn't have to read it all into memory at the same time, necessarily - stream the data in/out, perhaps reading and writing a line at a time.)

That way your reading code doesn't need to worry about trying to do something useful with the data and fixing it up at the same time.

Jon Skeet
+1  A: 

The xsi namespace is normally reserved for use with Schema Instance Namespace:

xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'

if it isn't, your XML file is not XML+NS compliant and cannot be parsed. So you should solve that in the source document.

A note on xsi: it is even more vital than some possible other namespaces, because it directs a validating parser to the correct schema locations for the schema of your XML.

Abel