ansaurus

Question

Easiest way to extract the urls from an html page using sed or awk only.

Answer 1

+1 A:

An example, since you didn't provide any sample

awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
  for(o=1;o<=NF;o++){
    if ( $o ~ /href/){
      gsub(/.*href=\042/,"",$o)
      gsub(/\042.*/,"",$o)
      print $(o)
    }
  }
}' index.html

2009-12-10 14:26:30

Does this work for '<a href="http://aktuell.de.selfhtml.org/" target="_blank">SELFHTML aktuell</a>'

Ralph Rickenbach 2009-12-10 14:40:33

if i say it works, (maybe not 100%, but 99.99%) of the time, would you believe?? :). The best is to try out yourself on various pages and see.

2009-12-10 14:54:24

Answer 2

+1 A:

You can do it quite easily with the following regex, which is quite good at finding URLs:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

I took it from John Gruber's article on how to find URLs in text.

That lets you find all URLs in a file f.html as follows:

cat f.html | grep -o \
    -E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'

nes1983 2009-12-10 14:28:41

complicated, and fails when href is like this: ... HREF="http://www.somewhere.com/" ADD_DATE="1197958879" LAST_MODIFIED="1249591429"> ...

2009-12-10 14:35:42

I tried it on the daringfireball page itself and it found all links. other solutions may fail because href= could be somewhere inside regular text. it's difficult to get this absolutely right without parsing the HTML according to its grammar.

nes1983 2009-12-10 14:45:10

Answer 3

+1 A:

I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.

OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!

Alok 2009-12-15 07:43:10

Answer 4

+1 A:

You asked for it:

$ wget -O - http://stackoverflow.com | \
  grep -o '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
  sed -e 's/^<a href=["'"'"']//' -e 's/["'"'"']$//'

This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

Greg Bacon 2009-12-17 23:09:22

Answer 5

+2 A:

You could also do something like this (provided you have lynx installed):

lynx -dump -listonly my.html

Hardy 2010-01-04 13:06:42

ansaurus

tags:

views:

answers:

Easiest way to extract the urls from an html page using sed or awk only.

related questions