tags:

views:

60

answers:

1

I'm trying to create a large XML tree in R. Here's a simplified version of the code:

library(XML)
N = 100000#In practice is larger  10^8/ 10^9
seq = newXMLNode("sequence")
pars = as.character(1:N)
for(i in 1:N)
    newXMLNode("Parameter", parent=seq, attrs=c(id=pars[i]))

When N is about N^6 this takes about a minute, N^7 takes about forty minutes. Is there anyway to speed this up?

Using the paste command:

par_tmp = paste('<Parameter id="', pars, '"/>', sep="")

takes less than a second.

+1  A: 

I would recommend profiling the function using Rprof or the profr package. This will show you where your bottleneck is, and you then you can think about ways to either optimize the function or change the way that you're using it.

Your paste example would be much faster in part because it's vectorized. For a more fair comparison, you can see the difference there by looping over paste as you are currently doing with newXMLNode and see the difference in timing.

Edit:

Here is the output from profiling your loop with profr.

library(profr)
xml.prof <- profr(for(i in 1:N) 
    newXMLNode("Parameter", parent=seq, attrs=c(id=pars[i])))
plot(xml.prof)

There is nothing especially obvious in here about places that you can improve this. I see that it spends a reasonable amount of time in the %in% function, so improving that would reduce the overall time somewhat (although you still need to iterate over this repeatedly, so it won't make a huge difference). The best solution would be to rewrite newXMLNode as a vectorized function so you can skip the for loop entirely. alt text

Shane