tags:

views:

67

answers:

1

I have a peculiar problem replacing some text in an xml file using awk regex matching.

The xml files are simple. There's a paragraph of text in the node of each xml and the awk program replaces this text with another paragraph of text picked from the text file rtxt. But for some reason the text in rtxt (labeled '42') that substitutes the text in 42.xml does not produce the proper substitution.

toxml.awk write to stdout. It first prints the xml as it has read it, and then the final replaced result.

I actually have a collection of these xml files where I do a replacement with text picked from a longer rtxt. It so happens that this particular replacement (for 42.xml) doesn't work. Instead of the text in the element being replaced, another tag gets nested within the existing one.


toxml.awk

BEGIN{
    srcfile = "rtxt"
    FS = "|"

    while (getline <srcfile) {
    xmlfile = $1 ".xml"
    rep = "<narrative>" $2 "</narrative>"

    ## read in the xml file in one go.
    ## (the last tag will be missing.)
    RS = "</topic>"
    FS = "</topic>"

    getline <xmlfile
    #print $0
    close(xmlfile)

    ## replace
    subs = gsub(/<narrative>.*<\/narrative>/, rep, $0)

    ## append the closing tag
    subs = gsub(/[ \n\r\t]+$/, "\n</topic>", $0)
    print $0

    ## restore them before reading rtxt.
    RS = "\n"
    FS = "|"
    }

    close(srcfile)
}

rtxt

42|Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it. To be relevant, a result should give information on history of Java & on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. I like to find articles that discuss this programming language and various concepts & versions of it.


42.xml

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE topic SYSTEM "topic.dtd">
<topic id="2009042" ct_no="227">

  <title>sun java</title>

  <castitle>//article[about(.//language, java) or about(.,sun)]//sec[about(.//language, java)]</castitle>

  <phrasetitle>"sun java"</phrasetitle>

  <description>Find information about Sun Microsystem's Java language</description>

  <narrative>Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it.    To be relevant, a result should give information on history of Java &amp; on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. I like to find articles that discuss this programming language and various concepts &amp; versions of it.  </narrative>

</topic>

A: 

Just a start

#!/bin/bash

awk 'BEGIN{FS="|"}
FNR==NR{  nar[$1]=$2; next }
END{
  for(i=2;i<ARGC;i++){
     xmlfile=ARGV[i]
     split(xmlfile,fname,".")
     print "Doing file: "xmlfile
     print "---------------------------------"
     while( (getline line < xmlfile ) > 0)  {
         if ( line ~ /<narrative>/ ){
            line="<narrative>"nar[fname[1]]"</narrative>"
         }
         print line
     }
  }
}' rtxt 42.xml 71.xml
ghostdog74
Modified it. Take a look.
sauparna