views:

3174

answers:

7

I have a plain text file with words, which are separated by comma, for example:

word1, word2, word3, word2, word4, word5, word 3, word6, word7, word3

i want to delete the duplicates and to become:

word1, word2, word3, word4, word5, word6, word7

Any Ideas? I think, egrep can help me, but i'm not sure, how to use it exactly....

+1  A: 

I'd think you'll want to replace the spaces with newlines, use the uniq command to find unique lines, then replace the newlines with spaces again.

McWafflestix
uniq only compare adjacent lines so this will not work.
Beano
it will when combined with sort
Jonik
+11  A: 

Assuming that the words are one per line, and the file is already sorted:

uniq filename

If the file's not sorted:

sort filename | uniq

If they're not one per line, and you don't mind them being one per line:

tr -s [:space:] \\n < filename | sort | uniq

That doesn't remove punctuation, though, so maybe you want:

tr -s [:space:][:punct:] \\n < filename | sort | uniq

But that removes the hyphen from hyphenated words. "man tr" for more options.

Randy Orrison
that works for me :) thanks a lot...i only need to put all words back in a one row with:cat testfile_out.txt | tr "\n" " " > testfile_out2.txt
cupakob
"sort -u" would remove the need for uniq
Beano
+1  A: 

ruby -pi.bak -e '$_.split(",").uniq.join(",")' filename ?

I'll admit the two kinds of quotations are ugly.

Oliver N.
Ruby isn't a Linux command! I presume by Linux command he means regular GNU programs.
Danny
@Danny, I saw that, and you could do this with some overzealous sed/awk alchemy, but really I think this is a job for a scripting language.
Oliver N.
+1 as this seems undeniably elegant, and more approachable for mortals compared to Igor Krivokon's Perl one-liner :)
Jonik
A: 

I presumed you wanted the words to be unique on a single line, rather than throughout the file. If this is the case, then the Perl script below will do the trick.

while (<DATA>)
{
    chomp;
    my %seen = ();
    my @words = split(m!,\s*!);
    @words = grep { $seen{$_} ? 0 : ($seen{$_} = 1) } @words;
    print join(", ", @words), "\n";
}

__DATA__
word1, word2, word3, word2, word4, word5, word3, word6, word7, word3

If you want uniqueness over the whole file, you can just move the %seen hash outside the while (){} loop.

Beano
Perl isn't a Linux command! I presume by Linux command he means regular GNU programs. Then again Perl is installed everywhere... heh.
Danny
Could you please point out what your definition of a "Linux command" is (or rather @rbright's as you seem to know him)? Maybe a command found in Linux distributions?
Beano
i mean a command, which is integrated in the default installation of the most popular distros...for example somethink like grep.
cupakob
+2  A: 

Creating a unique list is pretty easy thanks to uniq, although most Unix commands like one entry per line instead of a comma-separated list, so we have to start by converting it to that:

$ sed 's/, /\n/g' filename | sort | uniq
word1
word2
word3
word4
word5
word6
word7

The harder part is putting this on one line again with commas as separators and not terminators. I used a perl one-liner to do this, but if someone has something more idiomatic, please edit me. :)

$ sed 's/, /\n/g' filename | sort | uniq | perl -e '@a = <>; chomp @a; print((join ", ", @a), "\n")'
word1, word2, word3, word4, word5, word6, word7
Ryan Bright
tr " " "\n" might be more efficient than sed in this case
florin
and that is also working
cupakob
Putting that on one line is quite simple:sed 's/, /\n/g' filename | sort | paste -s -d, | sed 's/,/, /g'the command is paste, a very nice one!
Mapio
`tr " " "\n"' is different because it doesn't handle the commas and you can't just ignore the commas because the last word doesn't have one. With the example in the question, you'd end up uniq'ing "word3" and "word3,". Another answer has a tr command that would remove all whitespace and all punctuation if that's what you're after. I was just being specific.
Ryan Bright
A: 

G'day,

And don't forget the -c option for the uniq utility if you're interested in getting a count of the words as well.

cheers,

Rob Wells
+1  A: 

Here's an awk script that will leave each line in tact, only removing the duplicate words:

BEGIN { 
     FS=", " 
} 
{ 
    for (i=1; i <= NF; i++) 
        used[$i] = 1
    for (x in used)
        printf "%s, ",x
    printf "\n"
    split("", used)
}
mamboking
that works also, but not perfect ;) the output contains a word with two commas....that ist not a big problem :) thanks a lot
cupakob