tags:

views:

238

answers:

3

I'm trying to parse a large XML file. I read it using XML::SAX (using Expat, not the perl implementation) and put all the second level and below nodes into my "Node" class:

package Node;
use Moose;

has "name" =>
(
    isa =>  "Str",
    reader  => 'getName'
);

has "text" =>
( 
  is    =>  "rw",
  isa   =>  "Str"
);

has "attrs" =>
(
    is  =>  "rw",
    isa =>  "HashRef[Str]"

);

has "subNodes"  =>
(
    is  =>  "rw",
    isa =>  "ArrayRef[Node]",
    default => sub { [] }
);

sub subNode
{   
  my ($self, $name) = @_;
  my $subNodeRef = $self->subNodes;
  my @matchingSubnodes = grep { $_->getName eq $name } @$subNodeRef;

  if (scalar(@matchingSubnodes) == 1)
  {

    return $matchingSubnodes[0];
  }
  return undef;
}


1;

In the "end_element" sub, I check if this is a node I care about, and if it is, I do some further processing.

This all worked fine on my test files, but the day before yesterday I threw it at my real file, all 13 million lines of it, and it's taking forever. It's been running for over 36 hours. How do I tell if it's Moose or XML::SAX that's the bottleneck? Is Moose always this slow, or am I using it wrong?

Update Doing a profile on a 20,000 line subset of the data shows that it is Moose that's the bottleneck - specifically in Class::MOP::Class::compute_all_applicable_attributes (13.9%) and other Class and Moose classes.

+20  A: 

While Moose does quite some work at startup time, which sometimes makes it appear a little slow, the code it generates, especially things like accessors for attributes, are generally quite a bit faster than what the average perl programmer would be able to write. So given the runtime of your process is quite long, I doubt any overhead induced by Moose will be relevant.

However, from the code you've shown, I can't really tell what you're bottleneck is, even though I firmly believe it isn't Moose. I also want to point out that doing __PACKAGE__->meta->make_immutable, to state that you're class is now "finalised" allows Moose to do some further optimisations, but still I doubt this is what's causing you trouble.

How about you take a smaller sample of your data, so your program will finish in a reasonable time, and have a look at that in a profiler such as Devel::NYTProf. That'll be able to tell you where exactly the time in your program is spent, so you can optimise specifically those parts to get the greatest possible gain.

One possibility is that the type constraints you're using slow things down quite a bit. Actually validating instance attributes as thoroughly on every single write access to them (or on class instanciation), isn't something most programmers would usually do. You could try using simpler constraints, such as ArrayRef instead of ArrayRef[Node], if you're certain enough about the validity of your data. That way, only the type of the attribute value itself will be checked, not the value of every element in that array reference.

But still, profile your code. Don't guess.

rafl
+1 For profiling suggestion
T.E.D.
Is NYTProf that much better than DProf?
Paul Tomblin
No, much more than that.
rafl
The `__PACKAGE__->meta->make_immutable` thing made my subset go from 11 seconds down to 6 seconds. Now the profiler says it's spending 26% of its time in `Moose::Meta::TypeConstraint::ArrayRef[Node]` so I'm going to try your suggestion to loosen the constraint on that next.
Paul Tomblin
Loosening the constraints didn't have any appreciable difference - my 20,000 line test file went from 6.4 seconds to 6.3 seconds.
Paul Tomblin
@Paul: what does the time breakdown look like now? it should be mostly in XML processing now, not in Moose innards.
Ether
`dprofpp -I -u` shows 64% in XML::SAX::Expat::_handle_start, of which 51% is in EADHandler::start_element (where I "new" the Node class), 26% is Moose::Meta::TypeConstrant::ArrayRef[Node].
Paul Tomblin
@Paul: how does the percentage change if you switch to using a type constraint of just `ArrayRef`?
Ether
@Ether, almost exactly the same run time, but now it's 43% in `XML::SAX::Expat::_handle_start`, 26.5% in `EADHandler::start_element` and only 16.8% in `Node::new`.
Paul Tomblin
+2  A: 

I have successfully written large XML processing apps using XML::Twig 745mb file take less then an hour to run on a reasonably sized box.

But as other users have already mentioned you need to profile your code to figure out what exactly is causing the issue.

cfaulkingham
+2  A: 

I highly suspect that your speed problem is not in Moose so much as it is in memory allocation and disk swapping. Even without doing ->meta->make_immutable, based on your times for the 20K subset, your script should finish in about 2 hours (((11 * (13_000_000 / 20_000)) / 60) == ~119 min). By doing ->meta->make_immutable it would have cut it down to approx. 65 min or so.

Try running your big script again and see what your memory and swap are doing, I suspect your giving your disk an awful thrashing.

Stevan Little
Munin says I was hardly swapping at all during the first 36 hour run. See http://xcski.com/munin/xcski.com/allhats2.xcski.com-cpu.html, http://xcski.com/munin/xcski.com/allhats2.xcski.com-memory.html and http://xcski.com/munin/xcski.com/allhats2.xcski.com-swap.html from the evening of 10Oct to the mid morning of 12Oct. You can see more swap in use during the second run (starting about 10am on 12Oct) after I did make_immutable, but that may be related to other things I was doing at the same time.
Paul Tomblin
Upvoted for being Stevan Little.
jrockway