tags:

views:

38

answers:

1

I have to deal with various input files with a number of fields, arbitrarily arranged, but all consistently named and labelled with a header line. These files need to be reformatted such that all the desired fields are in a particular order, with irrelevant fields stripped and missing fields accounted for. I was hoping to use AWK to handle this, since it has done me so well when dealing with field-related dilemmata in the past.

After a bit of mucking around, I ended up with something much like the following (writing from memory, untested):

# imagine a perfectly-functional BEGIN {} block here

NR==1 {
  fldname[1] = "first_name"
  fldname[2] = "last_name"
  fldname[3] = "middle_name"
  maxflds = 3

  # this is just a sample -- my real script went through forty-odd fields

  for (i=1;i<=NF;i++) for (j=1;j<=maxflds;j++) if ($i == fldname[j]) fldpos[j]=i
}

NR!=1 {
  for (j=1;j<=maxflds;j++) {
    if (fldpos[j]) printf "%s",$fldpos[j]
    printf "%s","/t"
  }
  print ""
}

Now this solution works fine. I run it, I get my output exactly how I want it. No complaints there. However, for anything longer than three fields or so (such as the forty-odd fields I had to work with), it's a lot of painfully redundant code which always has and always will bother me. And the thought of having to insert a field somewhere else into that mess makes me shudder.

I die a little inside each time I look at it.

I'm sure there must be a more elegant solution out there. Or, if not, perhaps there is a tool better suited for this sort of task. AWK is awesome in it's own domain, but I fear I may be stretching it's limits some with this.

Any insight?

A: 

The only suggestion that I can think of is to move the initial array setup into the BEGIN block and read the ordered field names from a separate template file in a loop. Then your awk program consists only of loops with no embedded data. Your external template file would be a simple newline-separated list.

BEGIN {while ((getline < "fieldfile") > 0) fldname[++maxflds] = $0}

You would still read the header line in the same way you are now, of course. However, it occurs to me that you could use an associative array and reduce the nested for loops to a single for loop. Something like (untested):

BEGIN {while ((getline < "fieldfile") > 0) fldname[$0] = ++maxflds}

NR==1 {
    for (i=1;i<=NF;i++) fldpos[i] = fldname[$i]
}
Dennis Williamson
I like the idea of the associative array, but afaik there's no clean way to guarantee order of this array while printing, short of writing out a sort function from scratch (no gawk hence no asort, unfortunately). I'm thinking `fldpos[fldname[$i]] = i` in the header loop could work though, since it gives me an integer key to loop through while printing...
goldPseudo
@goldPseudo: Oops, I wasn't thinking about that. I think your idea would work.
Dennis Williamson