If there is at most one link per line and you don't mind some very ugly regular expression hacking, run the following code on your buffer:
(defun getlinks ()
(beginning-of-buffer)
(replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
(beginning-of-buffer)
(replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
(beginning-of-buffer)
(replace-regexp "
+" "
")
(beginning-of-buffer)
(replace-regexp "^LINK:\\(.*\\)$" "\\1")
)
It replaces all links with LINK:url|description, deletes all lines containing anything else, deletes empty lines, and finally removes the "LINK:".
Detailed HOWTO: (1) Correct the bug in your example html file by replacing <href
with <a href
, (2) copy the above function into Emacs scratch, (3) hit C-x C-e after the final ")" to load the function, (4) load your example HTML file, (5) execute the function with M-: (getlinks)
.
Note that the linebreaks in the third replace-regexp are important. Don't indent those two lines.