views:

3035

answers:

7

Sometimes I need to quickly extract some arbitrary data from XML files to put into a CSV format. What's your best practices for doing this in the Unix terminal? I would love some code examples, so for instance how can I get the following problem solved?

Example XML input:

<root>
<myel name="Foo" />
<myel name="Bar" />
</root>

My desired CSV output:

Foo,
Bar,
+2  A: 

Use a command-line XSLT processor such as xsltproc, saxon or xalan to parse the XML and generate CSV. Here's an example, which for your case is the stylesheet:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:output method="text"/>

 <xsl:template match="root">
  <xsl:apply-templates select="myel"/>
 </xsl:template>

 <xsl:template match="myel">
  <xsl:for-each select="@*">
   <xsl:value-of select="."/>
   <xsl:value-of select="','"/>
  </xsl:for-each>
  <xsl:text>&#10;</xsl:text>
 </xsl:template> 
</xsl:stylesheet>
Peter Hilton
+1  A: 

If you just want the name attributes of any element, here is a quick but incomplete solution.

(Your example text is in the file example)

grep "name" example | cut -d"\"" -f2,2 | xargs -I{} echo "{},"

+1  A: 

Awk? Never tried it for XML myself, but used it for a ton other jobs. Seems like it should be up to the task.

I found this nice tutorial on processing XML with GAWK

Mark Renouf
+3  A: 

Peter's answer is correct, but it outputs a trailing line feed.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="root">
    <xsl:for-each select="myel">
      <xsl:value-of select="@name"/>
      <xsl:text>,</xsl:text>
      <xsl:if test="not(position() = last())">
        <xsl:text>&#xA;</xsl:text>
      </xsl:if>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

Just run e.g.

xsltproc stylesheet.xsl source.xml

to generate the CSV results into standard output.

jelovirt
A: 

Here's a little ruby script that does exactly what your question asks (pull an attribute called 'name' out of elements called 'myel'). Should be easy to generalize

#!/usr/bin/ruby -w

require 'rexml/document'

xml = REXML::Document.new(File.open(ARGV[0].to_s))
xml.elements.each("//myel") { |el| puts "#{el.attributes['name']}," if el.attributes['name'] }
AndrewR
+2  A: 

XMLStarlet is a command line toolkit to query/edit/check/transform XML documents (for more information see http://xmlstar.sourceforge.net/)

No files to write, just pipe your file to xmlstarlet and apply an xpath filter.

cat file.xml | xml sel -t -m 'xpathExpression' -v 'elemName' 'literal' -v 'elname' -n -m expression -v value '' included literal -n newline

So for your xpath the xpath expression would be //myel/@name which would provide the two attribute values.

Very handy tool.

HTH

DaveP
A: 

your test file is in test.xml.

sed -n 's/^\s*<myel\s*name="([^"]*)".*$/\1,/p' test.xml

It has it's pitfalls, for example if it is not strictly given that each myel is on one line you have to "normalize" the xml file first (so each myel is on one separate line)