tags:

views:

17

answers:

1

Given a tabbed delimited text file with two columns "name" and "date":

Sam     20100101
Sam     20100101
Sam     20100102
George  20100101
George  20100101
George  20100102

Can I use awk to add third column with a partitioned running total, such as starting a counter at 0 for each distinct pattern of column one and two then incrementing for any repeat of the pattern?

Sam     20100101     1
Sam     20100101     2
Sam     20100102     1
George  20100101     1
George  20100101     2
George  20100102     1

I do this in SQL: select row_number() over (partition by column 1, column two order by column 10) from table

but I need an alternative - not necessarily awk, but this came to mind - for text files not processed by SQL. The text files are ~ 50GB each with about 200 mil rows.

I use the counter to remove duplicate rows. I am aware that given my example above, this can be done with:

sort myfile.txt | uniq -u

but my text files have additional columns not included in my example (for brevity) that may not produce a fully duplicated row.

should I be adding a parameters to match only name and date columns in the

sort myfile.txt | uniq -u

command? Awk? Something else?

+3  A: 

If all the lines have identical formatting (including whitespace), something as simple as this will do the trick:

awk '{print $0, ++a[$0]}'

I am not sure about the performance on GB size files, since this will need to store a number per unique row. If your data is sorted you can get away with a fixed amount of storage:

awk '{ if ($0 != prev) n = 0; print $0, ++n; prev = $0 }'
schot
Thanks for this! Not all the lines have identical formatting. Can I switch the ++a[$0] to something that will match on the first two columns only? ++a[$1$2]?
M. Roessler
@M. Roessler Yes, that will work fine.
schot