views:

40

answers:

2

I have several csv's that look like this:

I have several large text files (csv's) that on some lines have redundant entries. That is, due to the way they were merged a certain field will often have the same value twice or three times. It's not always in the same order though.

BWTL, NEWSLETTER, NEWSLETTER
BWTL, NEWSLETTER, R2R, NEWSLETTER
MPWJ, OOTA HOST, OOTA HOST, OOTA HOST
OOTA HOST, ITOS, OOTA HOST

Etc. The entries that are next to each other are easy enough to clean up with sed

sed -i "" 's/NEWSLETTER, NEWSLETTER/NEWSLETTER/g' *.csv

Is there a similar quick way to fix up the other duplicates?

A: 

You could do something like

sed -i "" 's/^\(.*NEWSLETTER.*\), NEWSLETTER/\1/g' eNewsletter.csv_new.csv

It works by capturing everything up to the second NEWSLETTER ^ means beginning of line \( and \) delimit the capture, and .* denotes anything at all. It then replaces the matched string with just the part captured.

deinst
that works great! it does however leave me with double commas or commas at the end of the line.
alex
Doh! I'll fix it
deinst
That change (adding the comma) makes it so that it no longer removes duplicates? Or (after running my sed line) it appears that it only now affects duplicates that are not beside each other?
alex
Double Doh! Fixed now.
deinst
how about OOTA HOST? they are duplicates as well.
ghostdog74
Perfect! thank you so much!@ghostdog74: I just replaced NEWSLETTER with OOTA HOST or any one of the other duplicate phrases and it's cleared them out as well.
alex
A: 
#!/bin/bash

awk -F"," '
{
 delete a
 for(i=1;i<=NF;i++){
 gsub(/^ +| +$/,"",$i)
 if( !( $i in a) ) {
     printf "%s,",$i
     a[$i]
 }
 }
 print ""
}' file
ghostdog74