views:

901

answers:

5

I want to extract the URL from within the anchor tags of an html file. This needs to be done in BASH using SED/AWK. No perl please.

What is the easiest way to do this?

Thanks a lot.

+1  A: 

An example, since you didn't provide any sample

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html
Does this work for '<a href="http://aktuell.de.selfhtml.org/" target="_blank">SELFHTML aktuell</a>'
Ralph Rickenbach
if i say it works, (maybe not 100%, but 99.99%) of the time, would you believe?? :). The best is to try out yourself on various pages and see.
+1  A: 

You can do it quite easily with the following regex, which is quite good at finding URLs:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

I took it from John Gruber's article on how to find URLs in text.

That lets you find all URLs in a file f.html as follows:

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
nes1983
complicated, and fails when href is like this: ... HREF="http://www.somewhere.com/" ADD_DATE="1197958879" LAST_MODIFIED="1249591429"> ...
I tried it on the daringfireball page itself and it found all links. other solutions may fail because href= could be somewhere inside regular text. it's difficult to get this absolutely right without parsing the HTML according to its grammar.
nes1983
+1  A: 

I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.

OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!

Alok
+1  A: 

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -o '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//' -e 's/["'"'"']$//'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

Greg Bacon
+2  A: 

You could also do something like this (provided you have lynx installed):

lynx -dump -listonly my.html
Hardy