ansaurus

Question

Regular expression to match unique words in files

Answer 1

A:

Can you put every word on one line? If you can you can use the command uniq:

uniq -c yourfile

This way every unique word will have a count of 1.

Eduard Wirch 2009-11-03 08:34:02

Unfortunately not, I really have to mark unique words with something like "UNIQUE:" inside the original files.

Patrick Allaert 2009-11-03 08:38:33

uniq report or filter out repeated lines in a file, not words.

Niels Castle 2009-11-03 12:14:19

Answer 2

+13 A:

Niels Castle 2009-11-03 08:41:19

The replacement operation needs word boundaries. To see the problem, add another data item: 'berry'.

FM 2009-11-07 12:08:44

Nice catch, I added word boundaries in the regexp

Niels Castle 2009-11-07 20:55:52

Answer 3

+5 A:

It's not possible to do this with a single execution of a regexp. The reason for this is because after the first replace is done the internal cursor is moved at the end of that match, and the next time it starts matching it forgets what's behind it. And as it happens to be, dynamic look-behinds are not supported, so you can't check if "this word has already appeared before this matching position". What you can do, however, is replace one word with each execution of a regexp (because this way you can always anchor at the start of the string). So what you want to do is run the following regexp as long as it replaces something.

s/^.*?\K(?!UNIQUE:)\b(\w+)\b(?=(?:(?!\b\1\b).)*$)/UNIQUE:\1/s

reko_t 2009-11-03 09:00:41

+1 for can't be done in a single execution of a regexp.

Jonathan Leffler 2009-11-03 09:08:39

Good explanation!

Bart Kiers 2009-11-07 21:11:42

Answer 4

+1 A:

i don't know why "lemon" is unique, but let's just say i assume it to be only a single occurence of the word, then here's an awk script

awk '{
 for(i=1;i<=NF;i++){
    words[$i]++
    if( words[$i] > 1){   delete words[$i]  }
 }
 a[++d]=$0
}END{
 for(i=1;i<=d;i++){
    m=split(a[i],t," ")
    for(k=1;k<=m;k++){
        if ( t[k] in words ) {
            t[k]="UNIQUE:"t[k]
        }
    }
    for(w=1;w<=d;w++){
        printf "%s ",t[w] 
    }
    print ""
 }
}' file

output

$ more file
watermelon banana
apple pear pineapple orange mango
strawberry cherry
kiwi pineapple lemon cranberry watermelon
orange plum cherry
kiwi banana plum
mango cranberry apple
lemon

$ ./shell.sh
watermelon banana
apple UNIQUE:pear pineapple orange mango
UNIQUE:strawberry cherry
kiwi pineapple lemon cranberry watermelon
orange plum cherry
kiwi banana plum
mango cranberry apple
lemon

ghostdog74 2009-11-03 09:28:13

Answer 5

+2 A:

Note that you need boundaries on the replacement operation; otherwise a unique apple could collide with a non-unique cranapple, for example.

use strict;
use warnings;
use File::Slurp qw(read_file);

my %words;
my $content = read_file(shift @ARGV);
$words{$_} ++ for split /[\s\W]+/, $content;
my @uniq = grep { $words{$_} == 1 and length } keys %words;
$content =~ s/\b$_\b/UNIQUE:$_/g for @uniq;
print $content;

FM 2009-11-04 02:35:43

ansaurus

tags:

views:

answers:

Regular expression to match unique words in files

related questions