views:

116

answers:

5

Hi,

I have a file of the format:

<a href="http://www.wowhead.com/?search=Superior Mana Oil">
<a href="http://www.wowhead.com/?search=Tabard of Brute Force">
<a href="http://www.wowhead.com/?search=Tabard of the Wyrmrest Accord">
<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">

I need to select the text after the = but before the " and print this at the end of the line, adding so it becomes for example:

<a href="http://www.wowhead.com/?search=Superior Mana Oil">Superior Mana Oil</a>
<a href="http://www.wowhead.com/?search=Tabard of Brute Force">Tabard of Brute Force</a>
<a href="http://www.wowhead.com/?search=Tabard of the Wyrmrest Accord">Tabard of the Wyrmrest Accord</a>
<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">Tattered Hexcloth Sack</a>

I'm not sure of the best way to do this via linux command line (I guess probably sed/awk but not good with them), would ideally like a script I can just feed the filename e.g. ./fixlink.sh brokenlinks.txt

Thanks

+2  A: 
awk 'BEGIN{ FS="=" }
{
    o=$NF
    gsub(/\042>/,"",o)
    print $0, o"</a>"

}' file

output

$ ./shell.sh
<a href="http://www.wowhead.com/?search=Superior Mana Oil"> Superior Mana Oil</a>
<a href="http://www.wowhead.com/?search=Tabard of Brute Force"> Tabard of Brute Force</a>
<a href="http://www.wowhead.com/?search=Tabard of the Wyrmrest Accord"> Tabard of the Wyrmrest Accord</a>
<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack"> Tattered Hexcloth Sack</a>

if you are not good at something, read up the docs. That's always the start to the solution. For learning about awk/gawk, go to the doc.

ghostdog74
+3  A: 

Assuming you can have one or more space afer <a, and zero or more space around the = signs, the following should work:

$ cat in.txt
<a href="http://www.wowhead.com/?search=Superior Mana Oil">
<a href="http://www.wowhead.com/?search=Tabard of Brute Force">
<a href="http://www.wowhead.com/?search=Tabard of the Wyrmrest Accord">
<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">
#
# The command to do the substitution
#
$ sed -e 's#<a[ \t][ \t]*href[ \t]*=[ \t]*".*search[ \t]*=[ \t]*\([^"]*\)">#&\1</a>#' in.txt
<a href="http://www.wowhead.com/?search=Superior Mana Oil">Superior Mana Oil</a>
<a href="http://www.wowhead.com/?search=Tabard of Brute Force">Tabard of Brute Force</a>
<a href="http://www.wowhead.com/?search=Tabard of the Wyrmrest Accord">Tabard of the Wyrmrest Accord</a>
<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">Tattered Hexcloth Sack</a>

If you're sure you don't have the extra spaces, the pattern simplifies to:

s#<a href=".*search=\([^"]*\)">#&\1</a>#

In sed, s followed by any character (# in this case) starts substitution. The pattern to be substituted is until the second appearance of the same character. So, in our second example, the pattern to be substituted is: <a href=".*search=\([^"]*\)">. I used \([^"]*\) to mean, any sequence of non-" characters, and saved it in backreference \1 (the \(\) pair denotes a backreference). Finally, the next token delimited by # is the replacement. & in sed stands for "whatever matched", which in this case is the whole line, and \1 just matches the link text.

Here's the pattern again:

's#<a[ \t][ \t]*href[ \t]*=[ \t]*".*search[ \t]*=[ \t]*\([^"]*\)">#&\1</a>#'

and its explanation:

'                       quote so as to avoid shell interpreting the characters
s                       substitute
#                       delimiter
<a[ \t][ \t]*           <a followed by one or more whitespace
href[ \t][ \t]*=[ \t]*  href followed by optional space, = followed by optional space
".*search[ \t]*=[ \t]*  " followed by as many characters as needed, followed by
                        search, optional space, =, followed by optional space
\([^"]*\)               a sequence of non-" characters, saved in \1
">                      followed by ">
#                       delimiter, replacement pattern starts
&\1                     the matched pattern, followed by backreference \1.
</a>                    end the </a> tag
#                       end delimiter
'                       end quote

If you're really sure that there will always be search= followed by the text you want, you can do:

$ sed -e 's#.*search=\(.*\)">#&\1</a>#'

Hope that helps.

Alok
No downvote because of the heroic effort, but when one line of code requires 14 lines of explanation, it's probably too clever for the next person to maintain it.
Adam Liss
LOL @Adam: I was assuming that the OP didn't know anything about regular expressions. That, coupled with making a "robust" pattern resulted in a long explanation. Oh well, I tried. Hopefully, he learned *something* (if he didn't get bored 1/3rd of the way down my post that is!). :-)
Alok
@Alok: When I try to explain something technical at that level of detail, I usually find that I learn something myself - so it's never a wasted effort.
Adam Liss
A: 

then let's do it in sed.

replace.sh

#!/bin/bash
#<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">
# =>
#<a href="http://www.wowhead.com/?search=Tattered Hexcloth Sack">Tattered Hexcloth Sack</a>
sed -r -e 's|(<a href=".*search=(.*))">|\1">\2</a>|' $1

./replace.sh input.txt

Dyno Fu
A: 

Use sed:

sed 's/\(.*search=\)\(.*\)\(".*\)/\1\2\3\2<\/a>/' brokenlinks.txt
codeape
+2  A: 

Nice awk! But

sed -n 's|=\([^"].*\)">|&\1</a>|p'

is shorter and will silently remove lines that don't match.

martinwguy
Alok