ansaurus

Question

Extract all the links between specified html tags from an html file with sed

Answer 1

+2 A:

As is posted every day on SO: You can't process HTML with regular expressions. http://stackoverflow.com/questions/701166

That goes double for a tool as limited as sed, with its Basic Regular Expressions.

If the kind of input you have is very limited such that every link is in the exact same format, it might be possible, in which case you'd have to post an example of that format. But for general HTML pages, it can't be done.

ETA given your example: at the simplest level, since each URL is already on its own line, you could select the ones that look right and throw away the bits you don't want:

#!/bin/sed -f
s/^<td><a href="\(.*\)">.*<\/a><\/td>$/\1/p
d

However note that this would still leave URLs in their HTML-encoded form. If the script that produced this file is correctly HTML-encoding its URLs, you would then have to replace any instances of the lt/gt/quot/amp entity references back to their plain character form ‘<>"&’. In practice the only one of those you're likely to meet is &/amp, which is very common indeed in URLs.

But! That's not all the HTML-encoding that might have occurred. Maybe there are other HTML entity references in there, like eacute (which would be valid now we have IRIs), or numerical character references (in both decimal and hex). There are two million-odd potential forms of encoding for characters including Unicode... replacing each one individually in sed would be a massive exercise in tedium.

Whilst you could possibly get away with it if you know that the generator script will never output any of those, an HTML parser is still best really. (Or, if you know it's well-formed XHTML, you can use a simpler XML parser, which tends to be built in to modern languages' standard libraries.)

bobince 2009-08-18 11:28:37

sed is turing complete, so it is possible. Probably the wrong tool for the job, but possible.

Triptych 2009-08-18 11:35:34

A friend of mine told me that this is possible through perl but I don't have privileges to install that...

2009-08-18 11:38:09

So how am I supposed to do that? I can say that every link is in the exact same format because they are auto-generated.

2009-08-18 11:40:52

[added sed script]

bobince 2009-08-19 15:10:53

Answer 2

A:

Hello Mehmet,

if you have access to python i would recommend BeautifulSoup. A nice python library for manipulating HTML. The following code collects links from a given ressource, which is a full name to a webpage like http://www.foo.com, and stores them in file. Hope this helps.

import sys, os
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

fileLinksName = "links.dat"

if __name__ == "__main__":
    try:
     # get all links so far
     fileLinks = open(fileLinksName)

     links = fileLinks.read().split('\n')

     fileLinks.close()

     htmlFileSoup = BeautifulSoup(urlopen(sys.argv[1]).read())

     anchorList = htmlFileSoup.findAll('a')

     for htmlAnchor in anchorList:
      print htmlAnchor
      if 'href' in htmlAnchor:
       links.append(htmlAnchor)

     for link in links:
      print link
    except:
     print sys.exc_info()
     exit()

da8 2009-08-18 12:07:04

Answer 3

A:

This might be possible if instead of trying to look at the tags you just look for the URLs.

If these are the only URLs in the page you can write a pattern to look for URLs between quotes, something like:

"[a-z]+://[^"]+"

Dave Webb 2009-08-18 12:13:12

Answer 4

A:

Do you have access to AWK? A combination of AWK and sed might do what you want, provided that:

The html is relatively simple
The html doesn't change suddenly (I mean in form, not in content)
The html is not excessively convoluted.

It's false that you can't process HTML with regular expressions. It's true that in the general case, you can't process HTML (or XML) with regexes, because they allow arbitrary nesting and regexes don't do recursion well -or at all-. But if your HTML is relatively 'flat' you can certainly do much with regexes.

I can't tell you exactly what to do, because I've forgotten what little AWK and sed I learned in college, but this strikes me as something doable:

Find the string <div id="links">
Now find the string <table>
Now find the string <td>...</td> and get a link from it (this is the regex part).
Append it to var $links
Until you find the string </table>
Finally, print $links separating each link with \n.

Again, this is just pseudocode for the simple case. But it might just work.

I mention AWK because, even if you don't have access to Perl, sed and AWK tend to be both installed.

Finally, for a pure sed solution, you could also take a look at this sed recipe and adapt it to your needs.

Adriano Varoli Piazza 2009-08-18 13:03:42

ansaurus

tags:

views:

answers:

Extract all the links between specified html tags from an html file with sed

related questions