views:

121

answers:

2

let's say, I have this xml file:

<?xml version="1.0" encoding="UTF-8" ?>
<TimeSeries>
  <timeZone>1.0</timeZone>
  <series>
    <header/>
    <event date="2009-09-30" time="10:00:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:15:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:30:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:45:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="11:00:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="11:15:00" value="0.0" flag="2"></event>
  </series>
  <series>
    <header/>
    <event date="2009-09-30" time="08:00:00" value="1.0" flag="2"></event>
    <event date="2009-09-30" time="08:15:00" value="2.6" flag="2"></event>
    <event date="2009-09-30" time="09:00:00" value="6.3" flag="2"></event>
    <event date="2009-09-30" time="09:15:00" value="4.4" flag="2"></event>
    <event date="2009-09-30" time="09:30:00" value="3.9" flag="2"></event>
    <event date="2009-09-30" time="09:45:00" value="2.0" flag="2"></event>
    <event date="2009-09-30" time="10:00:00" value="1.7" flag="2"></event>
    <event date="2009-09-30" time="10:15:00" value="2.3" flag="2"></event>
    <event date="2009-09-30" time="10:30:00" value="2.0" flag="2"></event>
  </series>
  <series>
    <header/>
    <event date="2009-09-30" time="10:00:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:15:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:30:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="10:45:00" value="0.0" flag="2"></event>
    <event date="2009-09-30" time="11:00:00" value="0.0" flag="2"></event>
  </series>
</TimeSeries>

and let's say I want to do something with its series elements and that I would like to put in practice the advice 'vectorize the vectorizable'... I import the XML library and do the following:

R> library("XML")
R> doc <- xmlTreeParse('/home/mario/Desktop/sample.xml')
R> TimeSeriesNode <- xmlRoot(doc)
R> seriesNodes <- xmlElementsByTagName(TimeSeriesNode, "series")
R> length(seriesNodes)
[1] 3
R> (function(x){length(xmlElementsByTagName(x[['series']], 'event'))}
+ )(seriesNodes)
[1] 6
R>

and I don't understand why I should only get the result of applying the function to the first element: I had expected three values, just as the length of seriesNodes, something like this:

R> mapply(length, seriesNodes)
series series series 
     7     10      6

oops! I already came with the answer: "use mapply":

R> mapply(function(x){length(xmlElementsByTagName(x, 'event'))}, seriesNodes)
series series series 
     6      9      5

but then I see the following problem: the R-inferno tells me that I'm "loop-hiding", not "vectorizing"! can I avoid looping at all? ...

+3  A: 

Since seriesNodes is a list of nodes, there is no easy way to avoid the implicit looping. Simple operations like getting the length are not computationally intensive, so I wouldn't lose any sleep over not being able to vectorise.

Note that you can use sapply(seriesNodes, length), instead of mapply, since there is only one argument to the length function.

The "proper R way" to do things is to use (s|m)apply calls to extract vectors of useful bits of data, then analyse those in the usual manner.

Finally, if you really are desperate to vectorise counting events, use names(unlist(seriesNodes)) and then count the occurances of "series.children.event.name" in between each occurance of "series.name". This is undoubtedly uglier, and possibly slower than the sapply call.

Richie Cotton
+3  A: 

You could also use xpathApply or xpathSApply-- these functions extract node sets using an XPath specification and then execute a function each set. Both of these functions are provided by the XML package. In order to use these functions, the XML document must be parsed using xmlInternalTreeParse or with the useInternalNodes option of xmlTreeParse set to be true:

require( XML )

countEvents <- function( series ){

  events <- xmlElementsByTagName( series, 'event' )
  return( length( events ) ) 

}

doc <- xmlTreeParse( "sample.xml", useInternalNodes = T )

xpathSApply( doc, '/TimeSeries/series', countEvents )
[1] 6 9 5

I don't know if it is any "faster", but the code is definitely cleaner and very explicit to anyone who knows the XPath syntax and how an apply function operates.

Sharpie
the help from xpathSApply is also particularly enlightening (and I'm using the XML package anyway!).
mariotomo