Given a tabbed delimited text file with two columns "name" and "date":
Sam 20100101
Sam 20100101
Sam 20100102
George 20100101
George 20100101
George 20100102
Can I use awk to add third column with a partitioned running total, such as starting a counter at 0 for each distinct pattern of column one and two then incrementing for any repeat of the pattern?
Sam 20100101 1
Sam 20100101 2
Sam 20100102 1
George 20100101 1
George 20100101 2
George 20100102 1
I do this in SQL: select row_number() over (partition by column 1, column two order by column 10) from table
but I need an alternative - not necessarily awk, but this came to mind - for text files not processed by SQL. The text files are ~ 50GB each with about 200 mil rows.
I use the counter to remove duplicate rows. I am aware that given my example above, this can be done with:
sort myfile.txt | uniq -u
but my text files have additional columns not included in my example (for brevity) that may not produce a fully duplicated row.
should I be adding a parameters to match only name and date columns in the
sort myfile.txt | uniq -u
command? Awk? Something else?