tags:

views:

63

answers:

2

I have a bunch of .xml files with nodes that are causing uncessesary complications. I would like to remove these nodes but ensure that thier children are preserved (not the heirarchical structure but the data). Eventually I want to take the data from each .xml and build a dataframe. It seems like xmlTreeParse along with xmlToList will help but the latter only works well with a flat structure. I have played around with unlisting the output from xmlToList and then converting it a dataframe but the output is a bit funky.

I thought about simply writing a function to go through all the files and delete all tags that I don't want however I don't know how to do this in R.

ANy suggestions?

A: 

Hi, see if this is what you are looking for, you can use XML package from CRAN for the parsing of XML documents. You can use the following tactic to get only the <poop> tags:

me<-xmlTreeParse(filename,useInternalNodes=T)
pooptags<-xpathApply(me,"//poop")

pooptags will contain the following information :

<poop>
  <P3a_Village1>dzemeni</P3a_Village1>
  <P4_HousholdNumber/>
  <P5_VisitNumber>2</P5_VisitNumber>
</poop> 

you can paste this with the <?xml version='1.0' ?> using paste command in R and write it to a truncated file. or you can further extract information like P3a_Village1 from the XML file using the xpathApply like this:

village<-xpathApply(me,"//poop/P3a_Village1")

I hope the solution is what you are looking for. Please let me know if it helps.

Neo_Me
thanks for the help. I think this would be a looong way to do it so I decided to use an xslt script. Oh well...
scottyaz
+3  A: 

It's simple to do in XSLT. Add this to the identity transform:

<xsl:template match="poop">
   <xsl:apply-templates select="node()"/>
</xsl:template>

Using regular expressions on XML hastens the coming of the Elder Gods and is not recommended.

Robert Rossney