views:

317

answers:

2

I'm trying to parse data from Archive.org's search functionality. The data looks like this:

 <doc>
   <float name="avg_rating">5.0</float>
   <arr name="collection"><str>U-Melt</str><str>etree</str></arr>
   <arr name="format"><str>Checksums</str><str>Flac</str><str>Flac FingerPrint</str>
     <str>Metadata</str><str>Ogg Vorbis</str><str>Text</str><str>VBR M3U</str>
     <str>VBR MP3</str><str>VBR ZIP</str></arr>
   <str name="identifier">umelt2009-09-19.main.km184.flac16</str>
   <str name="mediatype">etree</str>
   <int name="num_reviews">1</int>
 </doc>

Here's a link to the full XML.

PHP's SimpleXML picks up fine getting to each doc, and can read the items labeled str and arr just fine. It's the items labeled float, int or long that it freaks out on, and I can't figure out why.

My parsing code is as follows:

/* OPENING FILE */

$xml = simplexml_load_file($pathname.$identifier_list);

//Check the file to make sure it's got XML in it
$xmlCheck = file_get_contents($pathname.$identifier_list);
$xmlCheck = substr($xmlCheck,0,4);

if (!$xmlCheck == "<?xm") {
 die("<p>WARNING: ".$filename." doesn't looks like XML, quitting. Check it to see what's wrong.");
}
else {

 $result = $xml->result;
 echo "<br/><br/>".$result['name']."<br/>";

 $counter = 1;

 foreach ($result->doc as $doc) {

  echo "<br/><b>Document ".$counter."</b>";
  $counter++;

  foreach ($doc->children() as $item) {
   echo $item->getName();
   switch ((string) $item['name']) {
    case 'identifier':
     echo "<br/>Identifier: ".$item."\n";
     break;
    case 'licenseurl':
     echo "<br/>License URL: ".$item."\n";
     break;
    case 'mediatype':
     echo "<br/>Mediatype: ".$item."\n";
     break;
    case 'downloads':
     echo "<br/>Downloads: ".$item."\n";
     break;
    case 'avg_rating':
     echo "<br/>Average Rating: ".$item."\n";
     break;
    case 'collection':
     echo "<br/>Collection: ".$item."\n";
     break;
   }
  }
  echo "<br/>";
 }
}

I've tried using ->children(), ->doc and ->long or ->int. None of these seem to pick up the long/int/float items. I'm beginning to think that it's because they're primitives, but I don't know how to fix this issue.

Thanks in advance for your help.

+1  A: 

Hi,

Taking a look at that XML data (the search.xml you linked to), I don't seem to have a problem.

For instance, if I do this :

$xml = simplexml_load_file('search.xml');
foreach ($xml->result->doc as $doc) {
    var_dump($doc);
}

I have several outputs, each looking like this :

object(SimpleXMLElement)[3]
  public 'float' => string '0.0' (length=3)
  public 'arr' => 
    array
      0 => 
        object(SimpleXMLElement)[5]
          public '@attributes' => 
            array
              'name' => string 'collection' (length=10)
          public 'str' => 
            array
              0 => string 'sijis' (length=5)
              1 => string 'netlabels' (length=9)
              2 => string 'netlabels' (length=9)
      1 => 
        object(SimpleXMLElement)[6]
          public '@attributes' => 
            array
              'name' => string 'format' (length=6)
          public 'str' => 
            array
              0 => string '256Kbps MP3' (length=11)
              1 => string 'Text' (length=4)
  public 'long' => string '4721' (length=4)
  public 'str' => 
    array
      0 => string 'sijis_SI8' (length=9)
      1 => string 'http://creativecommons.org/licenses/by-nc-sa/2.0/' (length=49)
      2 => string 'audio' (length=5)
  public 'int' => string '0' (length=1)

(I'm using Xdebug, which gives me nice var_dumps)

This shows that 'int', 'long', and equivalents are immediate children of the $doc, used in the loop ; which means you can use something like this :

$xml = simplexml_load_file('search.xml');
foreach ($xml->result->doc as $doc) {
    echo $doc->long . ' ; ' . $doc->float . '<br />';
}

To get to the 'long' and 'float' data ; which gives that kind of ouput, for the first documents :

4721 ; 0.0
;
2206 ; 0.0
1239 ; 3.5

Does this help you ?


Actually, your code seems to work quite OK for me ; if I remove the "echo $item->getName();" line, to get a clearer output, I get, for the first document :

Document 1
Average Rating: 0.0
Collection:
Downloads: 4721
Identifier: sijis_SI8
License URL: http://creativecommons.org/licenses/by-nc-sa/2.0/
Mediatype: audio

Which seems OK, when looking at the XML ?
For instance, the downloads count seems OK ?

Pascal MARTIN
You provided a technically correct answer, even though it didn't solve the problem because there was underlying idiocy on my part. Thanks for your help, I've marked you as the correct answer.
Dean Putney
Thanks! Have fun :-)
Pascal MARTIN
A: 

Ahem. So it appears that the XML I was reading from wasn't a large enough sample size to include the data I'm looking for. If I increase the number of rows, the data appears and my code is fine.

So, yay for my code working, boo for me being an idiot and not being able to figure it out earlier.

Thanks for your help.

Dean Putney
huhu, ok ^^ Bad luck ^^
Pascal MARTIN