tags:

views:

113

answers:

2

Hi, I am trying to parse the XML file in R, so that I can analysis the data. I am trying to get the mean and standard deviation of the price. Also I would like to be able to get the rate of change in the time of the share price changing. I have tried entering the data by hand but am having problems with the date structure ( I have tried the following:

z <- strptime ("HH:MM:SS.ms, "%H:%m:%S.%f")

but it failed to work). I know the XML file only has a small few numbers but is it a process that could be automated and if so what packages would I need? (I am new to R). Any help would be much appreciated.

Thanks, Anthony.

<?xml version = "1.0"?>
    <Company >
    <shareprice>
    <timeStamp> 12:00:00:01</timeStamp>
    <Price>  25.02</Price>
    </shareprice>



    <shareprice>
    <timeStamp> 12:00:00:02</timeStamp>
    <Price>  15</Price>
    </shareprice>



    <shareprice>
    <timeStamp> 12:00:00:025</timeStamp>
    <Price>  15.02</Price>
    </shareprice>



    <shareprice>
    <timeStamp> 12:00:00:031</timeStamp>
    <Price>  18.25</Price>
    </shareprice>



    <shareprice>
    <timeStamp> 12:00:00:039</timeStamp>
    <Price>  18.54</Price>
    </shareprice>



    <shareprice>
    <timeStamp> 12:00:00:050</timeStamp>
    <Price> 16.52</Price>
    </shareprice>


   <shareprice>
    <timeStamp> 12:00:01:01</timeStamp>
    <Price>  17.50</Price>
   </shareprice>
</Company>
+6  A: 

In

z <- strptime ("HH:MM:SS.ms, "%H:%m:%S.%f")

you miss a closing " so it is invalid syntax.

Next, the data is non-standard as we would use a dot for seconds.subseconds, ie 12:23:34.567 to denote a timestamp. The milliseconds can be parsed this way

> ts <- "12:00:00.050"
> strptime(ts, "%H:%M:%OS")
[1] "2010-07-09 12:00:00 CDT"
> 

So you not only need to get it out of XML first, but also need to convert the string. Else, you can parse the string an fill a POSIXlt time structure 'by hand'.

Postscriptum: Forgot to mention that you need to enable printing of sub-second times:

> options("digits.secs"=3)         # shows milliseconds (three digits)
> strptime(ts, "%H:%M:%OS")
[1] "2010-07-09 12:00:00.05 CDT"   # suppresses trailing zero
> 

Postscriptum 2: You are also in luck with respect to your file thanks to the XML package:

> library(XML)
> xmlToDataFrame("c:/Temp/foo.xml")     # save your data as c:/Temp/foo.xml
      timeStamp   Price
1   12:00:00:01   25.02
2   12:00:00:02      15
3  12:00:00:025   15.02
4  12:00:00:031   18.25
5  12:00:00:039   18.54
6  12:00:00:050   16.52
7   12:00:01:01   17.50
> 
Dirk Eddelbuettel
Hi Dirk, Thanks for the quick reply. I followed the steps you gave and added the following: library (XML) test.df <- xmlToDataFrame("c:/Users/user/Desktop/shares.xml") attach (test.df) mean (Price)I get the following: [1] NAWarning message:In mean.default(Price) : argument is not numeric or logical: returning NAIs there other commands I am missing?
Anthony Keane
perhaps try mean(as.numeric(Price)).
Greg
Thanks Greg, that worked.
Anthony Keane
+2  A: 

For more complex XML data, it might useful to use the XML package.

library(XML)

check <- xmlInternalTreeParse("/PathToXMLFile/checkXML.xml")
xpathSApply(check, "//timeStamp", xmlValue)
## [1] " 12:00:00:01"  " 12:00:00:02"  " 12:00:00:025" " 12:00:00:031"
## [5] " 12:00:00:039" " 12:00:00:050" " 12:00:01:01" 
apeescape
My last update also used `XML` via the easier-to-use `xmlToDataFrame()` function.
Dirk Eddelbuettel