views:

65

answers:

2

This is linked to another question/code-golf i asked on http://stackoverflow.com/questions/3171552/code-golf-color-highlighting-of-repeated-text

I've got a file 'sample1.txt' with the following content:

LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.

I've got a script generating the following array of strings which occur in the file (only a few shown for illustration):

LoremIpsum
LoremIpsu
dummytext
oremIpsum
LoremIps
dummytex
industry
oremIpsu
remIpsum
ummytext
LoremIp
dummyte
emIpsum
industr
mmytext

I need to (from the top) see if 'LoremIpsum' occurs in file sample1.txt. If so, I want to replace all occurences of LoremIpsum with: <T1>LoremIpsum</T1>. Now, when the program moves to the next word 'LoremIpsu', it should NOT match against the <T1>LoremIpsum</T1> text inside sample1.txt. It should repeat the above for all elements of this 'array'. The next 'valid' one would be 'dummytext' and that should be tagged as <T2>dummytext</T2> .

I do think it should be possible to create a bash shell script solution for this rather than relying on perl/python/ruby programs.

A: 

Pure Bash (no externals)

At the Bash command line:

$ sample="LoremIpsumissimplydummytextoftheprintingandtypesettingindustry.LoremIpsumhasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook."
$ # or: sample=$(<sample1.txt)
$ array=(
LoremIpsum
LoremIpsu
dummytext
...
)
$ tag=0; for entry in ${array[@]}; do test="<[^>/]*>[^>]*$entry[^<]*</"; if [[ ! $sample =~ $test ]]; then ((tag++)); sample=${sample//${entry}/<T$tag>$entry</T$tag>}; fi; done; echo "Output:"; echo $sample
Output:
<T1>LoremIpsum</T1>issimply<T2>dummytext</T2>oftheprintingandtypesetting<T3>industry</T3>.<T1>LoremIpsum</T1>hasbeenthe<T3>industry</T3>'sstandard<T2>dummytext</T2>eversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatypespecimenbook.
Dennis Williamson
marvelous btw!!
RubiCon10
A: 

Straightforward with Perl:

#! /usr/bin/perl

use warnings;
use strict;

my @words = qw/
  LoremIpsum
  LoremIpsu
  dummytext
  oremIpsum
  LoremIps
  dummytex
  industry
  oremIpsu
  remIpsum
  ummytext
  LoremIp
  dummyte
  emIpsum
  industr
  mmytext
/;

my $to_replace = qr/@{[ join "|" =>
                        sort { length $b <=> length $a }
                        @words
                     ]}/;

my $i = 0;
while (<>) {
  s|($to_replace)|++$i; "<T$i>$1</T$i>"|eg;
  print;
}

Sample run (wrapped to prevent horizontal scrolling):

$ ./tag-words sample.txt
<T1>LoremIpsum</T1>issimply<T2>dummytext</T2>oftheprintingandtypesetting<T3>indus
try</T3>.<T4>LoremIpsum</T4>hasbeenthe<T5>industry</T5>'sstandard<T6>dummytext</T
6>eversincethe1500s,whenanunknownprintertookagalleyoftypeandscrambledittomakeatyp
especimenbook.

You may object that all the qr// and @{[ ... ]} business is on the baroque side. One could get the same effect with the /o regular-expression switch as in

# plain scalar rather than a compiled pattern
my $to_replace = join "|" =>
                 sort { length $b <=> length $a }
                 @words;

my $i = 0;
while (<>) {
  # o at the end for "compile (o)nce"
  s|($to_replace)|++$i; "<T$i>$1</T$i>"|ego;
  print;
}
Greg Bacon
Hi gbacon - umm - the second replacement should be "T2", third - "T3"... just fyi - i know its a minor change for your code
RubiCon10
@RubiCon10 Ack! Thanks and fixed!
Greg Bacon