tags:

views:

824

answers:

4

I have a XML file with the following data format:

<net NetName="abc" attr1="123" attr2="234" attr3="345".../>
<net NetName="cde" attr1="456" attr2="567" attr3="678".../>
....

Can anyone tell me how could I data mine the XML file using an awk one-liner? For example, I would like to know attr3 of abc. It will return 345 to me.

+6  A: 

In general, you don't. XML/HTML parsing is hard enough without trying to do it concisely, and while you may be able to hack together a solution that succeeds with a limited subset of XML, eventually it will break.

Besides, there are many great languages with great XML parsers already written, so why not use one of them and make your life easier?

I don't know whether or not there's an XML parser built for awk, but I'm afraid that if you want to parse XML with awk you're going to get a lot of "hammers are for nails, screwdrivers are for screws" answers. I'm sure it can be done, but it's probably going to be easier for you to write something quick in Perl that uses XML::Simple (my personal favorite) or some other XML parsing module.

Just for completeness, I'd like to note that if your snippet is an example of the entire file, it is not valid XML. Valid XML should have start and end tags, like so:

<netlist>
  <net NetName="abc" attr1="123" attr2="234" attr3="345".../>
  <net NetName="cde" attr1="456" attr2="567" attr3="678".../>
  ....
</netlist>

I'm sure invalid XML has its uses, but some XML parsers may whine about it, so unless you're dead set on using an awk one-liner to try to half-ass "parse" your "XML," you may want to consider making your XML valid.

In response to your edits, I still won't do it as a one-liner, but here's a Perl script that you can use:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Simple;

sub usage {
  die "Usage: $0 [NetName] ([attr])\n";
}

my $file = XMLin("file.xml", KeyAttr => { net => 'NetName' });

usage() if @ARGV == 0;

exists $file->{net}{$ARGV[0]}
  or die "$ARGV[0] does not exist.\n";


if(@ARGV == 2) {
  exists $file->{net}{$ARGV[0]}{$ARGV[1]}
    or die "NetName $ARGV[0] does not have attribute $ARGV[1].\n";
  print "$file->{net}{$ARGV[0]}{$ARGV[1]}.\n";

} elsif(@ARGV == 1) {
  print "$ARGV[0]:\n";
  print "  $_ = $file->{net}{$ARGV[0]}{$_}\n"
    for keys %{ $file->{net}{$ARGV[0]} };

} else {
  usage();
}

Run this script from the command line with 1 or 2 arguments. The first argument is the 'NetName' you want to look up, and the second is the attribute you want to look up. If no attribute is given, it should just list all the attributes for that 'NetName'.

Chris Lutz
my bad i forget to put in the complete file format
It's cool. I was just checking to make sure the code you posted was just a snippet and not your full file.
Chris Lutz
i might be underestimated the difficulty of this as i though it could be done in one-liner....:)
It probably can be done as a one-liner, but it'd be one hell of a one liner. Quick-and-dirty one-liner (untested, fill in the NETNAME and ATTRIBUTE you want): perl -e 'undef $/; $ref = XMLin <>, KeyAttr=>{net=>"NetName"}; print "$ref->{net}{NETNAME}{ATTRIBUTE}\n";' < file.xml
Chris Lutz
+7  A: 

I have written a tool called xml_grep2, based on XML::LibXML, the perl interface to libxml2.

You would find the value you're looking for by doing this:

xml_grep2 -t '//net[@NetName="abc"]/@attr3' to_grep.xml

The tool can be found at http://xmltwig.com/tool/

mirod
That is nice. I will check it out.
Alan Haggai Alavi
+4  A: 

xmlgawk can use XML very easily.

$ xgawk -lxml 'XMLATTR["NetName"]=="abc"{print XMLATTR["attr3"]}' test.xml

This one liner can parse XML and print "345".

Hirofumi Saito
That looks quite nifty.
Chris Lutz
this is great...but too bad my company linux doesn't install xmlgawk
+2  A: 

If you do not have xmlgawk and your XML format is fixed, normal awk can do.

$ nawk -F '[ ="]+' '/abc/{for(i=1;i<=NF;i++){if($i=="attr3"){print $(i+1)}}}' test.xml

This script can return "345". But I think it is very dangerous because normal awk can not use XML.

Hirofumi Saito