views:

395

answers:

6

I have a large (600 odd) set of search and replace terms that I need to run as a sed script over some files. The problem is that the search terms are NOT orthogonal... but I think I can get away with it by sorting by line length (i.e. pull out the longest matches first, and then alphabetically within each length. So given an unsort set of:

aaba
aa
ab
abba
bab
aba

what I want is a sorted set such as:

abba
aaba
bab
aba
ab
aa

Is there a way of doing it by say prepending the line lenght and sorting by a field?

For bonus marks :-) !!! The search and replace is actually simply a case of replacing term with _term_ and the sed code I was going to use was s/term/_term_/g How would I write the regex to avoid replacing terms already within _ pairs?

A: 

This will sort a file by line length, longest lines first:

cat file.txt | (while read LINE; do echo -e "${#LINE}\t$LINE"; done) | sort -rn | cut -f 2-

This will replace term with _term_ but won't turn _term_ into __term__:

sed -r 's/(^|[^_])term([^_]|$)/\1_term_\2/g'
sed -r -e 's/(^|[^_])term/\1_term_/g' -e 's/term([^_]|$)/_term_\1/g'

The first will work pretty well except it will miss out on _term and term_, mistakenly leaving those alone. Use the second if that's important. Here's my silly test case:

# echo here is _term_ and then a term you terminator haha _terminator and then _term_inator term_inator | sed -re 's/(^|[^_])term([^_]|$)/\1_term_\2/g'
here is _term_ and then a _term_ you _term_inator haha _terminator and then _term_inator term_inator
# echo here is _term_ and then a term you terminator haha _terminator and then _term_inator term_inator | sed -r -e 's/(^|[^_])term/\1_term_/g' -e 's/term([^_]|$)/_term_\1/g'
here is _term_ and then a _term_ you _term_inator haha __term_inator and then _term_inator _term__inator
John Kugelman
perfect! I'll give it a go!
Dycey
+1  A: 

Just pipe your stream through this kind of script :

#!/usr/bin/python
import sys

all={}
for line in sys.stdin:
    line=line.rstrip()
    if len(line) in all:
        all[len(line)].append(line)
    else:
        all[len(line)]=[line]

for l in reversed(sorted(all)):
    print "\n".join(reversed(sorted(all[l])))

And for the bonus mark question : again, do it in python (unless there really is a reason not to, but I'd be pretty curious to know it)

Gyom
Is that the shortest, or clearest way to do that sort, in Python?
Brad Gilbert
maybe not ; this was my first thought.
Gyom
Personally, this is a quick-and-dirty enough that I'd rather use a Perl one-liner than write an entire Python script. Though if you insist on Python, it might be cleaner (if less efficient) to just slurp the file, then sort it, then spit it back out.
Chris Lutz
+2  A: 

You could compact it all into one regexp:

$ sed -e 's/\(aaba\|aa\|abba\)/_\1_/g'
testing words aa, aaba, abba.
testing words _aa_, _aaba_, _abba_.

If I understand your question correctly, this will solve all your problems: No "double replacement" and always matching the longest word.

Johannes Hoff
Shouldn't you still sort the items by length? Or will there be some kind of greedy match going on that will always match the longest possible string?
mobrule
... plus, that's a hell of a long line for 600 items ;-) but maybe I can split it into more lines...
Dycey
No need for that: A regular expression will always find the longest match.
Johannes Hoff
@JH Good to know. Thanks.
mobrule
@Dycey: Yeah, that would be quite long. You could put the script in a file in that case and do `sed -f regexpfile`.
Johannes Hoff
A: 

This does the sort by length first, then reverse alpha bit

for mask in `tr -c "\n" "." < $FILE | sort -ur`
do
    grep "^$mask$" $FILE | sort -r
done

The tr usage replaces each character in $FILE with a period - which matches any single character in grep.

martin clayton
+3  A: 

You can do this in a one-line Perl script:

perl -e 'print sort { length $b<=>length $a || $b cmp $a } <>' input
mobrule
Should probably change `$a cmp $b` to be `$b cmp $a`, since he wanted it in reverse order.
Brad Gilbert
Thanks Brad, fixed.
mobrule
+1 Any task you might be using lots of shell scripting for can be done easier, shorter, and potentially clearer in Perl.
Chris Lutz
shorter doesn't mean clearer.
ghostdog74
I find this clearer than the Python solution http://stackoverflow.com/questions/1670397/_/1670454#1670454
Brad Gilbert
I would probably write it: `perl -E'say for sort { length $b<=>length $a } grep chomp, <>' input`
Brad Gilbert
+1  A: 
$ awk '{print length($1),$1}' file |sort -rn
4 abba
4 aaba
3 bab
3 aba
2 ab
2 aa

i leave you to try getting rid of the first column yourself

ghostdog74