views:

507

answers:

5

To prefix unique words with "UNIQUE:" inside a file I've tried to use a perl regex command like:

perl -e 'undef $/;while($_=<>){s/^(((?!\b\3\b).)*)\b(\w+)\b(((?!\b\3\b).)*)$/\1UNIQUE:\3\4/gs;print $_;}' demo

On a demo file containing:

watermelon banana
apple pear pineapple orange mango
strawberry cherry
kiwi pineapple lemon cranberry watermelon
orange plum cherry
kiwi banana plum
mango cranberry apple
lemon

The output is:

watermelon banana
apple pear pineapple orange mango
strawberry cherry
kiwi pineapple lemon cranberry watermelon
orange plum cherry
kiwi banana plum
mango cranberry apple
UNIQUE:lemon

Unfortunately, the \3 backreference doesn't seem to be handled if used in advance.

Is there another way to achieve this with another regex or with other usual commands available on a Linux box? (grep, sed, awk,...)

Many thanks

EDIT: Unfortunately, many of the solutions works for the provided case only which was incomplete, my apologies for that, it should also work on a text like:

{watermelon || banana}
apple = ( pear pineapple orange mango )
strawberry cherry
kiwi = pineapple = lemon = cranberry = watermelon
orange - plum = cherry
kiwi = banana + plum
mango = cranberry && apple
lemon

If it simplifies the problem, words may be prefixed with something like $ or @.

A: 

Can you put every word on one line? If you can you can use the command uniq:

uniq -c yourfile

This way every unique word will have a count of 1.

Eduard Wirch
Unfortunately not, I really have to mark unique words with something like "UNIQUE:" inside the original files.
Patrick Allaert
uniq report or filter out repeated lines in a file, not words.
Niels Castle
+13  A: 
Niels Castle
The replacement operation needs word boundaries. To see the problem, add another data item: 'berry'.
FM
Nice catch, I added word boundaries in the regexp
Niels Castle
+5  A: 

It's not possible to do this with a single execution of a regexp. The reason for this is because after the first replace is done the internal cursor is moved at the end of that match, and the next time it starts matching it forgets what's behind it. And as it happens to be, dynamic look-behinds are not supported, so you can't check if "this word has already appeared before this matching position". What you can do, however, is replace one word with each execution of a regexp (because this way you can always anchor at the start of the string). So what you want to do is run the following regexp as long as it replaces something.

s/^.*?\K(?!UNIQUE:)\b(\w+)\b(?=(?:(?!\b\1\b).)*$)/UNIQUE:\1/s
reko_t
+1 for can't be done in a single execution of a regexp.
Jonathan Leffler
Good explanation!
Bart Kiers
+1  A: 

i don't know why "lemon" is unique, but let's just say i assume it to be only a single occurence of the word, then here's an awk script

awk '{
 for(i=1;i<=NF;i++){
    words[$i]++
    if( words[$i] > 1){   delete words[$i]  }
 }
 a[++d]=$0
}END{
 for(i=1;i<=d;i++){
    m=split(a[i],t," ")
    for(k=1;k<=m;k++){
        if ( t[k] in words ) {
            t[k]="UNIQUE:"t[k]
        }
    }
    for(w=1;w<=d;w++){
        printf "%s ",t[w] 
    }
    print ""
 }
}' file

output

$ more file
watermelon banana
apple pear pineapple orange mango
strawberry cherry
kiwi pineapple lemon cranberry watermelon
orange plum cherry
kiwi banana plum
mango cranberry apple
lemon

$ ./shell.sh
watermelon banana
apple UNIQUE:pear pineapple orange mango
UNIQUE:strawberry cherry
kiwi pineapple lemon cranberry watermelon
orange plum cherry
kiwi banana plum
mango cranberry apple
lemon
ghostdog74
+2  A: 

Note that you need boundaries on the replacement operation; otherwise a unique apple could collide with a non-unique cranapple, for example.

use strict;
use warnings;
use File::Slurp qw(read_file);

my %words;
my $content = read_file(shift @ARGV);
$words{$_} ++ for split /[\s\W]+/, $content;
my @uniq = grep { $words{$_} == 1 and length } keys %words;
$content =~ s/\b$_\b/UNIQUE:$_/g for @uniq;
print $content;
FM