views:

528

answers:

2

I'm parsing an XML file with LibXML and need to sort the entries by date. Each entry has two date fields, one for when the entry was published and one for when it was updated.

<?xml version="1.0" encoding="utf-8"?>
...
<entry>
  <published>2009-04-10T18:51:04.696+02:00</published>
  <updated>2009-05-30T14:48:27.853+03:00</updated>
  <title>The title</title>
  <content>The content goes here</content>
</entry>
...

The XML file is already ordered by date updated, with the most recent first. I can easily reverse that to put the older entries first:

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($file);
my $xc = XML::LibXML::XPathContext->new($doc->documentElement());

foreach my $entry (reverse($xc->findnodes('//entry'))) {
  ...
}

However, I need to reverse sort the file by date published, not by date updated. How can I do that? The timestamp looks a little wonky too. Would I need to normalize that first?

Thanks!

Update: After fiddling around with XPath namespaces and failing, I made a function that parsed the XML and stored the values I needed in a hash. I then used a bare sort to sort the hash, which works just fine now.

+5  A: 

One way would be changing your reverse to a sort statement (untested):

sub parse_date {
    # Transforms date from 2009-04-10T18:51:04.696+02:00 to 20090410
    my $date= shift;
    $date= join "", $date =~ m!\A(\d{4})-(\d{2})-(\d{2}).*!;
    return $date;
}

sub by_published_date {
    my $a_published= parse_date( $a->getChildrenByTagName('published') );
    my $b_published= parse_date( $b->getChildrenByTagName('published') );

    # putting $b_published in front will ensure the descending order.
    return $b_published <=> $a_published;
}

foreach my $entry ( sort by_published_date $xc->findnodes('//entry') ) {
    ...
}

Hope this helps a bit!

Igor
Ah, I see now, I think... $a and $b are two individual entries, right? How could I programatically go through all of the entries, though? Some files have hundreds of entries...
Andrew
I'm still not getting where $a and $b come from...
Andrew
$a and $b are filled in by the sort function. All your function needs to do, for any two items in your list, is return -1 if $a should sort before $b, 1 if $b should sort before $a, and 0 otherwise. sort will handle the rest.
Chris Jester-Young
Just to clarify (because you asked "how you programatically go through all the entries"): sort will call your function many times, each time with two values from your list (but in no specific order).
Chris Jester-Young
Alright... this is almost working, except that the xPath is a little more complex than just 'published'--it's './post:published'. I have the namespace declared earlier as $xc->registerNs(post => 'http://www.w3.org/2005/Atom'); but once the object is recast as $a and $b it loses the namespace. Any way to maintain the namespace inside the sub?
Andrew
Just to be a bit more specific: `$a` and `$b` are global variables used by `sort` function. For more information about them, you can read `perldoc -f sort` on the command line, or http://perldoc.perl.org/functions/sort.html.
Igor
I'm not really sure what are you talking about. In the example I wrote, $a and $b are simply references to some element returned by $xc->findnodes().
Igor
The problem is with xPath. In your example, $a and $b get set to getChildrenByTagName('published'). In my XML file, though, the node has a name space: <post:published>...</pu...>. If I leave the simple 'published' name as the attribute for $a and $b the script fails because $a and $b both end up null. If I put the full xPath as the attribute the script fails because the namespace is unknown. So I need to somehow reference the namespace somewhere in the sort function... I just can't figure out where...
Andrew
Perhaps you should use getChildrenByTagNameNS() instead. Did you check XML::LibXML::Element documentation about those methods?
Igor
Yeah, I've been messing with the different NS fucntions in LibXML::Element but nothing's working. I'm going to try a different approach...
Andrew
I don't have much experience with XML::LibXML, I prefer to use XML::Twig. Perhaps you could give it a try.
Igor
I think the problem is that you are omitting the http:// from the namespace. Make sure you use the same namespace URI that the document declares "post" to be. (And remember, you can't use the short names as namespaces -- those are shortcuts for the markup, not shortcuts for parsing. Always use the URI.)
jrockway
+1  A: 

A bare sort may put times from different timezones out of order:

 print for sort "2009-06-15T08:00:00+07:00", "2009-06-15T04:00:00+00:00";

Here, the second time is 3 hours after the first, but sorts first.

I'm not sure what you mean by "wonky". Your example just shows timestamps in rfc3339 format.

ysth
Ah. I thought those timestamps were some proprietary thing and not an actual format. Thanks!
Andrew