views:

4272

answers:

8

When doing shell scripting, typically data will be in files of single line records like csv. It's really simple to handle this data with grep and sed. But I have to deal with XML often, so I'd really like a way to script access to that XML data via the command line. What are the best tools?

+5  A: 

At the moment, the best solution I've found is hpricot, which provides XPath & CSS selectors and a DOM. But it's only available in ruby, so I can't easily use it in a shell script.

EDIT I've found some more promising tools:

  • fxgrep: Uses its own XPath-like syntax to query documents. Written in SML, so installation may be difficult.

  • LT XML: XML toolkit derived from SGML tools, including sggrep, sgsort, xmlnorm and others. Uses its own query syntax. The documentation is very formal. Written in C. LT XML 2 claims support of XPath, XInclude and other W3C standards.

  • xmlgrep2: simple and powerful searching with XPath. Written in Perl using XML::LibXML and libxml2.

  • XQSharp: Supports XQuery, the extension to XPath. Written for the .NET Framework.

  • xml-coreutils: Laird Breyer's toolkit equivalent to GNU coreutils. Discussed in an interesting essay on what the ideal toolkit should include.

  • xmldiff: Simple tool for comparing two xml files.

I haven't had a chance to try any of these, but xml-coreutils seems the best documented and most unix oriented.

FURTHER EDIT

I've removed xmltk from this list. It doesn't seem to have package in debian, ubuntu, fedora, or macports. It also hasn't had a release since 2007, and uses non-portable build automation. I can't recommend it unless it becomes more portable.

Joseph Holsten
Couldn't you create a wrapper script for the Ruby program, and pass in the arguments' array in the script to hpricot? E.g., in a PHP shell script, something like the following should work: <?php /path/to/hpricot $argv ?>
alastairs
A: 

JEdit has a plugin called "XQuery" which provides querying functionality for XML documents.

Not quite the command line, but it works!

Ben
A: 

Decide on what operations you want to do on XML files and create a script (in Python, Perl perhaps) that exposes that functionality through arguments for shell scripts to use.

ΤΖΩΤΖΙΟΥ
+15  A: 

I've found xmlstarlet to be pretty good at this sort of thing.

http://xmlstar.sourceforge.net/

Should be available in most distro repositories, too. An introductory tutorial is here:

http://www.ibm.com/developerworks/library/x-starlet.html

Russ
+1 I have done some amazingly powerful stuff with xmlstartlet. Used with the standard mix of unix streams you can do a lot.
Elijah
A superb tool. I've used this to clean up tens of gigabytes of data stored in translation memories. It took a few days, but it gets the job done. Performance was not a requirement.
IanGilham
+2  A: 

Depends on exactly what you want to do.

XSLT may be the way to go, but there is a learning curve. Try xsltproc and note that you can hand in parameters.

Adrian Mouat
+1  A: 

XQuery might be a good solution. It is (relatively) easy to learn and is a W3C standard.

I would recommend XQSharp for a command line processor.

Oliver Hallam
+3  A: 

To Joseph Holsten's excellent list, I add the xpath command-line script which comes with Perl library XML::XPath. A great way to extract information from XML files:

 xpath -q -e '/entry[@xml:lang="fr"]' *xml
bortzmeyer
+2  A: 

There is also xml2 and 2xml pair. It will allow usual string editing tools to process XML.

Example. q.xml:

<?xml version="1.0"?>
<foo>
    text
    more text
    <textnode>ddd</textnode><textnode a="bv">dsss</textnode>
    <![CDATA[ asfdasdsa <foo> sdfsdfdsf <bar> ]]>
</foo>

xml2 < q.xml

/foo=
/foo=   text
/foo=   more text
/foo=   
/foo/textnode=ddd
/foo/textnode
/foo/textnode/@a=bv
/foo/textnode=dsss
/foo=
/foo=    asfdasdsa <foo> sdfsdfdsf <bar> 
/foo=

xml2 < q.xml | grep textnode | sed 's!/foo!/bar/baz!' | 2xml

<bar><baz><textnode>ddd</textnode><textnode a="bv">dsss</textnode></baz></bar>

P.S. There are also html2 / 2html.

Vi
Are you talking about this xml2? http://www.ofb.net/~egnor/xml2/
Joseph Holsten
@Joseph Holsten Yes. It allows hacking with XML without thinking through XPath things.
Vi
Nice! I had been focusing on tools that don't use an intermediate format, but the idea of a high-fidelity, line-oriented representation of xml seems like a great way to keep using real grep and sed.Have you tried pyxie? How does it compare? Any other line oriented representations?Would you consider this better than just replacing xml newlines with an entity ()? This would let you stick records on the same line at least.Oh, and could you edit your post to include a link to the project?
Joseph Holsten
@Joseph Holsten No, I don't think pyxie format whould be more useful than xml2 format. xml2 provides "full path" in nested XML elements, so allow more line-oriented matching and substitution. Also `2xml` can easily recreate XML from partial (filtered) `xml2` output.
Vi