views:

120

answers:

4

Hi I have dozens of XML files with
I would need this:

<p begin="00:06:28;12" end="00:00:02;26">

translated into this:

<p begin="628.12" end="631.08">

I know i need a simple awk or sed to do this, but being new; can someone help

+6  A: 

An XSL stylesheet would be more reliable. You can run one from a shell script.

bmargulies
+1  A: 

I would recommend using Perl (or another scripting language) with an XML parsing module (see here for more details on Perl and XML).

That way you can reliably parse the XML and extract/manipulate the values in a programmatic form. Note the word reliably. Your XML may make use of character encodings that a simple sed/awk wouldn't respect (unlikely in this scenario, admittedly, but it's well worth being aware of such issues).

Brian Agnew
+1  A: 

here's something for a start. I don't know how you want to add the decimal value, so you do it yourself

awk '/.*<p[ ]+begin=.*[ ]+end=.*/{
    o=$0
    gsub(/.*begin=\042|\042|>/,"")
    m=split($0,s,"end=")
    gsub(/[:;]/," ",s[1])
    gsub(/[:;]/," ",s[2])
    b=split(s[1],begin," ")
    e=split(s[2],end," ")
    # do date maths here
    if (b>3){
        tbegin=(begin[1]*3600) + (begin[2]*60) + begin[3]  ##"."begin[4]
    }else{
        tbegin=(begin[1]*60) + begin[3]  ##"."begin[4]
    }
    # add the decimal yourself
    if(e>3) {
        tend = (end[1]*3600) +(end[2]*60)+end[3]+ tbegin ##"."end[4]
    }else{
        tend = (end[1]*60)+end[3]+ tbegin ##"."end[4]
    }
    string=gensub("(.*begin=\042).*( end=\042)(.*)\042>", "\\1" tbegin "\042\\2" tend"\042>","g",o)
    $0=string
}
{print}
' file

eg

$ cat file
<p begin="00:06:28;12" end="00:00:02;26">
<p begin="00:08:45;12" end="00:00:23;26">
<p begin="08:45;12" end="00:2;26">

$ ./shell.sh
<p begin="388" end="390">
<p begin="525" end="548">
<p begin="492" end="518">

If you are doing more complex tasks other than this, use a parser.

ghostdog74
+2  A: 

Ah ghostdog74 beat me to it. However mine also deals with the ms.

awk '
    function timeToMin(str) {
        time_re = "([0-9][0-9]):([0-9][0-9]):([0-9][0-9]);([0-9][0-9])"

        # Grab all the times in seconds. 
        s_to_s =  gensub(time_re, "\\3", "g", str);
        m_to_s = (gensub(time_re, "\\2", "g", str)+0)*60;
        h_to_s = (gensub(time_re, "\\1", "g", str)+0)*60*60;
        ms     =  gensub(time_re, "\\4", "g", str);

        # Create float.
        time_str = (h_to_s+m_to_s+s_to_s)"."ms;

        # Converts from num to str.
        return time_str+0; 
    }
    function addMins(aS, bS) {
        # Split by decimal point
        split(aS, aP, ".");
        split(bS, bP, ".");

        # Add the seconds and ms.
        min = aP[1]+bP[1];
        ms  = aP[2]+bP[2];
        if (ms > 59) {
            ms = ms-60;
            mins++;
        }

        # Return addition.
        return (min"."ms)+0;
    }
    {
        re = "<p begin=\"(.+)\" end=\"(.+)\">";
        if ($0 ~ re) {
            # Pull out the data.
            strip_re = ".*"re".*";
            begin_str = gensub(strip_re, "\\1", "g");
            end_str   = gensub(strip_re, "\\2", "g");

            # Convert.
            begin = timeToMin(begin_str);
            end   = timeToMin(end_str);

            elapsed_end=addMins(begin, end);

            sub(re,"<p begin=\""begin"\" end=\""elapsed_end"\">");
        }

        print $0;
    }
' file
Jamie
If the down vote was for the disapproval of using awk, usually I'd agree however if its a one off script then I can't see any reason not to.
Jamie
this solution doesn't seem to workif input is <p begin="19.22" end="03.01"> the output is <p begin="70364.4" end="81383.1"> which is not correct
Ankur Chauhan
the regex should be time_re = "([0-9][0-9])?:?([0-9][0-9])?:?([0-9][0-9])?;?([0-9][0-9])?" and return time_str+0; should be return time_str; else 32.20 + 5.01 will be taken as 32.2 + 5.01
Ankur Chauhan
This script is an gawk script but I am on a mac and it doesn't recognize the gensub function. Any work arounds.
Ankur Chauhan