ansaurus

Question

How can I mine an XML document with awk, Perl, or Python?

Answer 1

+6 A:

In general, you don't. XML/HTML parsing is hard enough without trying to do it concisely, and while you may be able to hack together a solution that succeeds with a limited subset of XML, eventually it will break.

Besides, there are many great languages with great XML parsers already written, so why not use one of them and make your life easier?

I don't know whether or not there's an XML parser built for awk, but I'm afraid that if you want to parse XML with awk you're going to get a lot of "hammers are for nails, screwdrivers are for screws" answers. I'm sure it can be done, but it's probably going to be easier for you to write something quick in Perl that uses XML::Simple (my personal favorite) or some other XML parsing module.

Just for completeness, I'd like to note that if your snippet is an example of the entire file, it is not valid XML. Valid XML should have start and end tags, like so:

<netlist>
  <net NetName="abc" attr1="123" attr2="234" attr3="345".../>
  <net NetName="cde" attr1="456" attr2="567" attr3="678".../>
  ....
</netlist>

I'm sure invalid XML has its uses, but some XML parsers may whine about it, so unless you're dead set on using an awk one-liner to try to half-ass "parse" your "XML," you may want to consider making your XML valid.

In response to your edits, I still won't do it as a one-liner, but here's a Perl script that you can use:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Simple;

sub usage {
  die "Usage: $0 [NetName] ([attr])\n";
}

my $file = XMLin("file.xml", KeyAttr => { net => 'NetName' });

usage() if @ARGV == 0;

exists $file->{net}{$ARGV[0]}
  or die "$ARGV[0] does not exist.\n";


if(@ARGV == 2) {
  exists $file->{net}{$ARGV[0]}{$ARGV[1]}
    or die "NetName $ARGV[0] does not have attribute $ARGV[1].\n";
  print "$file->{net}{$ARGV[0]}{$ARGV[1]}.\n";

} elsif(@ARGV == 1) {
  print "$ARGV[0]:\n";
  print "  $_ = $file->{net}{$ARGV[0]}{$_}\n"
    for keys %{ $file->{net}{$ARGV[0]} };

} else {
  usage();
}

Run this script from the command line with 1 or 2 arguments. The first argument is the 'NetName' you want to look up, and the second is the attribute you want to look up. If no attribute is given, it should just list all the attributes for that 'NetName'.

Chris Lutz 2009-05-26 05:47:28

my bad i forget to put in the complete file format

2009-05-26 05:52:36

It's cool. I was just checking to make sure the code you posted was just a snippet and not your full file.

Chris Lutz 2009-05-26 05:53:24

i might be underestimated the difficulty of this as i though it could be done in one-liner....:)

2009-05-26 08:10:01

It probably can be done as a one-liner, but it'd be one hell of a one liner. Quick-and-dirty one-liner (untested, fill in the NETNAME and ATTRIBUTE you want): perl -e 'undef $/; $ref = XMLin <>, KeyAttr=>{net=>"NetName"}; print "$ref->{net}{NETNAME}{ATTRIBUTE}\n";' < file.xml

Chris Lutz 2009-05-26 08:31:57

Answer 2

+7 A:

I have written a tool called xml_grep2, based on XML::LibXML, the perl interface to libxml2.

You would find the value you're looking for by doing this:

xml_grep2 -t '//net[@NetName="abc"]/@attr3' to_grep.xml

The tool can be found at http://xmltwig.com/tool/

mirod 2009-05-26 07:23:43

That is nice. I will check it out.

Alan Haggai Alavi 2009-05-26 08:34:47

Answer 3

+4 A:

xmlgawk can use XML very easily.

$ xgawk -lxml 'XMLATTR["NetName"]=="abc"{print XMLATTR["attr3"]}' test.xml

This one liner can parse XML and print "345".

Hirofumi Saito 2009-05-26 12:53:00

That looks quite nifty.

Chris Lutz 2009-05-26 13:03:32

this is great...but too bad my company linux doesn't install xmlgawk

2009-05-27 02:27:46

Answer 4

+2 A:

If you do not have xmlgawk and your XML format is fixed, normal awk can do.

$ nawk -F '[ ="]+' '/abc/{for(i=1;i<=NF;i++){if($i=="attr3"){print $(i+1)}}}' test.xml

This script can return "345". But I think it is very dangerous because normal awk can not use XML.

Hirofumi Saito 2009-05-26 13:16:07

ansaurus

tags:

views:

answers:

How can I mine an XML document with awk, Perl, or Python?

related questions