tags:

views:

201

answers:

3

Hello all,

I have some text files containing lines as follow :

07JAN01, -0.247297942769082E+07, -0.467133797284279E+07, 0.355810777473149E+07

07JAN02, -0.247297942405032E+07, -0.467133797586388E+07, 0.355810777517715E+07

07JAN03, -0.247297942584851E+07, -0.467133797727224E+07, 0.355810777627353E+07

. . . .

. . . .

I need to produce a script which will amend the format of the date to :

01/01/07, -0.247297942769082E+07, -0.467133797284279E+07, 0.355810777473149E+07

02/01/07, -0.247297942405032E+07, -0.467133797586388E+07, 0.355810777517715E+07

03/01/07, -0.247297942584851E+07, -0.467133797727224E+07, 0.355810777627353E+07

. . . .

. . . .

I was looking for an appropriate sed or grep command to extract only some characters of each line, to define it as a variable in my script. As I would like to "reorganize" the date, I was thinking about defining three variable, where, for the for the first line it would be :

a=07

b=JAN (need to implement a "case" in the script to deal with this I think?)

c=03

I looked at some grep examples, and tons of docs, but nothing really clear appeared ... found something about the -cut command, but ... I'm not too sure it's appropriate here.

The other question I have is about the output, as sed doesn't modify the input data, how can I modify directly the files ? Is there a way ?

Any help would really be appreciated :)

+2  A: 

A bit clunky, but you could do:

sed -e 's/^\(..\)JAN\(..\)/\2\/01\/\1/'
sed -e 's/^\(..\)FEB\(..\)/\2\/02\/\1/'
...

In order to run sed in-place, see the -i commandline option:

sed -i -e ...

Edit

Just to point out that this answers a previous version of the question where AWK was not specified.

Draemon
Thanks ! I'll take a look at this.
Ackheron
I have to ask - why the downvote? I said it was clunky, but it took seconds to write, and it works. The AWK solutions are nice, but more complex.
Draemon
Ehm sorry, I don't really get the point on the votes . . . I didn't accepted your post as final answer, I wanted to wait. But I haven't "downvoted" it I think. Or if so, it wasn't intentional. As it is another way to solve the issue, it's helpful! I "upvoted" it now. :) Thx !
Ackheron
its clunky because you also executed sed 12 times(for each month), making it inefficient. think that's why it gets down voted.
ghostdog74
Ackheron: No you did the right thing - you accepted the best answer - it doesn't downvote unless you specifically click the down arrow.ghostdog74: sure it's clunky (like I said), but I'm sure the performance difference would be negligible in real terms.
Draemon
you are calling the file 12 times. if the file is a huge file that would be a problem in terms of performance. you can "improve" on it, by taking out the extra "sed". just use it one time.
ghostdog74
The OP didn't say the file was huge, or that performance was a primary concern. My solution answers the original question, performance perfectly well for most real-life cases, and overall is simple to understand. You could trade simplicity for performance, but not until you can justify it for the particular scenario in question.
Draemon
a good programmer/coder have to look at all possible scenarios to make good and resilient code. why wait for things to happen.
ghostdog74
No. A good programmer does not write code for all possible scenarios. A good programmer writes code for all *probable* scenarios, and ensures his code is easy to change/extend when unexpected requirements emerge. Maybe you should have suggested handwritten assembly for ultimate performance? Clarity is way more important.
Draemon
+4  A: 

I don't think grep is the right tool for the job myself. You need something a little more expressive like Perl or awk:

echo '07JAN01, -0.24729E+07, -0.46713E+07, 0.35581E+07
      07JAN02, -0.24729E+07, -0.46713E+07, 0.35581E+07
      07AUG03, -0.24729E+07, -0.46713E+07, 0.35581E+07' | awk -F, '
{
    yy=substr($1,1,2);
    mm=substr($1,3,3);
    mm=(index(":JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",mm)+2)/4;
    dd=substr($1,6,2);
    printf "%02d/%02d/%02d,%s,%s,%s\n",dd,mm,yy,$2,$3,$4
}'

which generates:

01/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
02/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
03/08/07, -0.24729E+07, -0.46713E+07, 0.35581E+07

Obviously, that's just pumping some test data through a command line awk script. You'd be better off putting that into an actual awk script file and running your input through it.

If datchg.awk contains:

{
    yy=substr($1,1,2);
    mm=substr($1,3,3);
    mm=(index(":JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",mm)+2)/4;
    dd=substr($1,6,2);
    printf "%02d/%02d/%02d,%s,%s,%s\n",dd,mm,yy,$2,$3,$4
}

then:

echo '07JAN01, -0.24729E+07, -0.46713E+07, 0.35581E+07
      07JAN02, -0.24729E+07, -0.46713E+07, 0.35581E+07
      07AUG03, -0.24729E+07, -0.46713E+07, 0.35581E+07' | awk -F, -fdatechg.awk

also produces:

01/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
02/01/07, -0.24729E+07, -0.46713E+07, 0.35581E+07
03/08/07, -0.24729E+07, -0.46713E+07, 0.35581E+07

The way this works is as follows. Each line is split into fields (-F, sets the field separator to a comma) and we extract and process the relevant parts of field 1 (the date). By this I mean the year and day are reversed and the textual month is turned into a numeric month by searching a string for it and manipulating the index where it was found, so that it falls in the range 1 through 12.

This is the only (relatively) tricky bit and is done with some basic mathematics: the index function simply finds the position within the string of your month (where the first char is 1). So JAN is at position 2, FEB at 6, MAR at 10, ..., DEC at 46 (the set {2, 6, 10, ..., 46}). They're 4 apart so we're going to need to divide by 4 eventually to get consecutive month numbers but first we add 2 so the division will work well. Adding that 2 gives you the set {4, 8, 12, ..., 48}. Then you divide by 4 to get {1, 2, 3, ... 12} and there's your month number:

Text   Pos   +2   /4
----   ---   --   --
JAN      2    4    1
FEB      6    8    2
MAR     10   12    3
APR     14   16    4
MAY     18   20    5
JUN     22   24    6
JUL     26   28    7
AUG     30   32    8
SEP     34   36    9
OCT     38   40   10
NOV     42   44   11
DEC     46   48   12

Then we just output the new information. Obviously, this is likely to barf if you provide bad data but I'm assuming either:

  • the data is good; or
  • you'll add your own error checks.

Regarding modifying the files directly, the time-honored UNIX tradition is to use a shell script to save the current file elsewhere, process it to create a new file, then overwrite the old file with the new file (but not touching the saved file, in case something goes horribly wrong).

I won't make my answer any longer by detailing that, you've probably fallen asleep already :-)

paxdiablo
Thanks a lot ... I've just pasted the code, and it works perfectly. Now I need to study the syntax to understand how it works really ;)Especially this: mm=(index(":JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",mm)+2)/4;Thanks a LOT, pax! It's nice to see that some people are still willing to help newbies, with precise and concise answers ;)
Ackheron
@Ackheron: the index simply finds the position within the string of your month (first char is 1). So JAN = 2, FEB = 6, MAR = 10, ..., DEC = 46. Then you add 2 to get 4, 8, 12, ..., 48. Then you divide by 4 to get 1, 2, 3, ... 12. See the update.
paxdiablo
awk is the best. Take any line-based input with fields separated by whitespace and pipe it into awk; you can access each field individually just using $0, $1 etc. e.g. cat myapachelog | awk '{print $10}' to display just the bytes transferred in a single col or cat myapachelog | awk '{total += $10} END {print total}' to output the total bytes served from the logfile
Flubba
@Pax: Thanks a lot for the update, really helps me to understand how it works by now ! ;) Really clear. And that's smart!@Flubba: Thank you for suggesting AWK as the best, but I guess there must be some case where AWK must struggle to answer the programmer's needs ? Right ? ;)
Ackheron
+1  A: 
awk 'BEGIN{
    OFS=FS=","
    # create table of mapping of months to numbers
    s=split("JAN:FEB:MAR:APR:MAY:JUN:JUL:AUG:SEP:OCT:NOV:DEC",d,":")
    for(o=1;o<=s;o++){
        m=sprintf("%02s",o)   # add 0 is single digit    
        date[d[o]]=m
    }
}
{
    yr=substr($1,1,2)
    mth=substr($1,3,3)
    day=substr($1,6,2)
    $1=day"/"date[mth]"/"yr    
}1' file
ghostdog74
Thanks for your solution ghostdog74 I'll give try on this one aswell.
Ackheron